SlideShare a Scribd company logo
Amazon Redshift
A whirl-wind tour

November 2013
Kel Graham
Data Architect
What is Redshift?
Amazon Product

Data Warehouse
Service

Fully managed

Redshift

Fast

Petabyte scale
(1PB == 1Billion MB)

1/10 cost of
traditional DW
As the universe expands, the wavelength of radiation from objects moving away
from an observer shifts towards the red end of the electromagnetic spectrum.
Redshift is a consequence of an expanding universe.
2
Where does Redshift sit within the Amazon database product suite?
Non-relational

Relational database
service

Data warehouse
service

Query flexibility

High availability

High availability

High availability

SimpleDB

DynamoDB

RDS

Redshift

(MySQL / Oracle / SQL Server)

(PostgreSQL base)

Web-services
interface

High scalability

Referential integrity

Cluster architecture

Smaller workloads

Run off SSDs

DB-dependent
feature-set (Multi-AZ)

Relational database

1MB response size

Integrates with
Redshift

Online Transaction
Processing

Horizontal scalability:
add more nodes

10GB hard limit
3

NoSQL service

Provisioned
throughput

Provisioned
throughput

Analytics
What differentiates Redshift from, say, a MySQL RDS instance?
Cluster Architecture

No RI by design

Redshift
Columnar storage

4

Read Optimised
What differentiates Redshift from, say, a MySQL RDS instance?
(i) Cluster architecture
a) Clients connect via existing protocols
to the Leader Node.
b) Leader node develops a query plan
and may generate and compile C++
code to be executed by the compute
nodes
c) Leader node will distribute work
across compute nodes using
Distribution Keys (more later)
d) Compute nodes receive work from
leader node and may transmit data
amongst themselves to answer the
query
e) Leader aggregates the results and
returns to client
f) Leader can distribute bulk data loads
across compute nodes: I have loaded
3G of raw data (gzipped to 500Mb)
on a single node in under 3 minutes)
5

source: http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
What differentiates Redshift from, say, a MySQL RDS instance?
(ii) Column-store database
a) Relational databases tend to
store data on a tuple by
tuple basis.
b) When querying the data, the
engine needs to read more
blocks of data, discarding
much of the data just read in
order to return columns
being queried
c) A column-store stores
columns contiguously in the
same block
d) Result: the number of IO
operations involved in a
query can be significantly
reduced, dependent on the
shape of the data
source: http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html

6
What differentiates Redshift from, say, a MySQL RDS instance?
(iii) Optimised for Read performance
a) Contrast block sizes with other databases:
a) Default MySQL installs on ext3 file-systems use 4k blocks
b) Default NTFS partitions use 4k blocks, so SQL Server on
NTFS defaults to 4k blocks as well.
b) Redshift’s focus on Data Warehousing (and hence read
optimisation) allows them to use a 1,024KB block size
c) Under a column-store architecture each block holds the same
kind of data, so datatype-specific compression enables even
more data to be stored per block, further reducing disk space
and IO
d) Reduced disk space and IO helps improve inter-node data
sharing and replication, where compute nodes may
redistribute data based on a table’s distribution key (more
later on that)

7

All your blox are
belong to me!
What differentiates Redshift from, say, a MySQL RDS instance?
(iv) No referential integrity

Ok, sounds g… WHAT?!!
•
•
•
•
•
•
•
•
•

8

Do tell Redshift about primary, foreign keys and
column uniqueness. It won’t enforce them, but it
will use these hints to better understand queries.

No primary key
No foreign key
No index support
No sequences
No user defined functions
No stored procedures
No common table expressions
No exotic data types – no arrays, JSON, Geospatial types, etc.
No ‘alter column’ syntax – drop and reload
How does Redshift locate data?
The Sort Key
•
•

Redshift will store data on disk in Sort Key order – so think of it as the single clustered index for
the table

•

Sort keys should be selected based on how the table is used:
• Columns that are frequently used to join to other tables should be included in the sort key
• Date and timestamp columns that are used in filtering operations should be included

•

Redshift stores metadata about each data block, including the min and max of each column value
– using this, Redshift can skip entire blocks when answering a query

•

9

Each table can have a single Sort Key – a compound key, comprised of 1 to 400 columns from the
table

After data loads or inserts, the ANALYZE command should be run
• ANALYZE updates the table metadata that is used by the query planner – very important for
column-based data storage and ongoing query performance
How does Redshift locate data?
The Distribution Key
•

•

10

Redshift will distribute and replicate data between compute nodes in order to get best use of
the parallelism available in the cluster
• By default, data will be spread evenly across all compute nodes (EVEN distribution)
• A node is further broken down into slices – one slice per CPU core
• Each slice participates in the parallel execution of a job sent from the Leader node, so the
even distribution of data across the nodes is vital to ensuring consistent query
performance
• If data is denormalised and does not participate in joins, then an EVEN distribution won’t
be problematic
Alternatively a Distribution key can be provided (KEY distribution)
• The Distribution key is important, in that it helps define which data is kept together on a
given node.
• The objective is to choose a key that helps distribute data across a node’s slices, but not
across the cluster’s nodes
• Similarly to the Sort Key, the Distribution key is defined on a per-table basis, but unlike a
Sort Key, the Distribution Key is comprised of only a single column
What typical RDBMS features does Redshift have?
Features
DataTypes (complete list):
• Transactions
• SmallInt
• Reasonable number of windowing functions
• Integer
• Rank, First, Last, Lag, Sum, Nth and so on
• Bigint
• Most types of relational joins
• Decimal
• Inner, Left, Right, Full, Cross
• Real
• Correlated sub-queries are supported, but only where • Double precision
the query planner can decorrelate them for
• Boolean
performance (sub-queries during a join are a no-go)
• Char
• Views
• Varchar
• Excellent locking and concurrent write capabilities
• Date
• Thanks PostgreSQL!
• Timestamp
• Schema management
• Identity columns (auto_increment)

11
Other features?
• Close integration with S3 and DynamoDB
• Our test instance was primed from S3:
COPY <tableName> from s3://bucket/file.csv.gz
header as 1
GZIP

• COPY command is central to the import process – can load data in parallel, using what it
knows about the structure of the target table to assign work to individual compute nodes
• UNLOAD will export data from a Redshift table out to an S3 bucket
• Excellent set of database system tables that allow one to monitor pretty much everything
that’s going on:
• Loads
• Queries
• Chatter between compute nodes
• Sort and distribution keys

12
Other features (cont)?
• Column compression
• Each column can have an optionally assigned compression algorithm, including:
• BYTEDICT – essentially a key-value lookup for up to 256 values. Useful for repeating
data, such as “State” in a property record
• DELTA – stores the initial value of a column as per its data type, and then stores only
the offset between the next value and the first value. Very useful for dates
• RUNLEGNTH – Stores the value of a column and the number of times the value is
repeated. Useful when the data is stored consecutively – relevant for sort-key
• MOSTLY8/16/32 – uses traditional numeric compression, but allows for outliers

13
Other features (cont)?
• Excellent monitoring and management console integration

14
How well does Redshift perform?
We ran some rudimentary queries over a realistic data set…
“Return the current list of all valid properties within a selected list
of states and tell me what the current number of bedrooms,
bathrooms, car spaces, land size, floor size and year built is.”

…and found that Redshift
outperformed our existing database
by a factory 2.5 – 3.5.

Correct select of a SORT KEY in Redshift is vital.
Any filtering or joins on a non-sortkey column will
result in (slow) a table scan. In our example, this
reduced performance by 30%.

However, this is not a particularly useful comparison as, these
were different machines with different hardware specifications.

15
Summary

Relational

No referential integrity

Column
Compression

Sort key

Column store

Fast

Accessible (JDBC, ODBC, S3)

Distribution Key

Massive Parallelism

Compute Node

Leader Node

Cluster Architecture

Redshift
16

More Related Content

What's hot

Redshift overview
Redshift overviewRedshift overview
Redshift overview
Amazon Web Services LATAM
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Amazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
Amazon Web Services
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
Amazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
Amazon Web Services
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Amazon Web Services
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
Amazon Web Services
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
 

What's hot (20)

Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Amazon Redshift Masterclass
Amazon Redshift MasterclassAmazon Redshift Masterclass
Amazon Redshift Masterclass
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and OptimizationAmazon Redshift: Performance Tuning and Optimization
Amazon Redshift: Performance Tuning and Optimization
 
SQL Server on AWS
SQL Server on AWSSQL Server on AWS
SQL Server on AWS
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 

Similar to A tour of Amazon Redshift

AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
Amazon Web Services
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
András Fehér
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
Viet-Trung TRAN
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
ShaimaaMohamedGalal
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
AWS Germany
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
Amazon Web Services
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
RithikRaj25
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
Suvradeep Rudra
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02Guillermo Julca
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
Keeyong Han
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
Na Zhu
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihub
ssuser9d7c90
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
政宏 张
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
Michel de Goede
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtable
maxtable
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
AnweshMishra21
 

Similar to A tour of Amazon Redshift (20)

AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Lecture3.ppt
Lecture3.pptLecture3.ppt
Lecture3.ppt
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
Implementing the Databese Server session 02
Implementing the Databese Server session 02Implementing the Databese Server session 02
Implementing the Databese Server session 02
 
AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
Cassandra Tutorial
Cassandra Tutorial Cassandra Tutorial
Cassandra Tutorial
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Presentacion redislabs-ihub
Presentacion redislabs-ihubPresentacion redislabs-ihub
Presentacion redislabs-ihub
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Introduction To Maxtable
Introduction To MaxtableIntroduction To Maxtable
Introduction To Maxtable
 
NoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint PresentationNoSQL Fundamentals PowerPoint Presentation
NoSQL Fundamentals PowerPoint Presentation
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

A tour of Amazon Redshift

  • 1. Amazon Redshift A whirl-wind tour November 2013 Kel Graham Data Architect
  • 2. What is Redshift? Amazon Product Data Warehouse Service Fully managed Redshift Fast Petabyte scale (1PB == 1Billion MB) 1/10 cost of traditional DW As the universe expands, the wavelength of radiation from objects moving away from an observer shifts towards the red end of the electromagnetic spectrum. Redshift is a consequence of an expanding universe. 2
  • 3. Where does Redshift sit within the Amazon database product suite? Non-relational Relational database service Data warehouse service Query flexibility High availability High availability High availability SimpleDB DynamoDB RDS Redshift (MySQL / Oracle / SQL Server) (PostgreSQL base) Web-services interface High scalability Referential integrity Cluster architecture Smaller workloads Run off SSDs DB-dependent feature-set (Multi-AZ) Relational database 1MB response size Integrates with Redshift Online Transaction Processing Horizontal scalability: add more nodes 10GB hard limit 3 NoSQL service Provisioned throughput Provisioned throughput Analytics
  • 4. What differentiates Redshift from, say, a MySQL RDS instance? Cluster Architecture No RI by design Redshift Columnar storage 4 Read Optimised
  • 5. What differentiates Redshift from, say, a MySQL RDS instance? (i) Cluster architecture a) Clients connect via existing protocols to the Leader Node. b) Leader node develops a query plan and may generate and compile C++ code to be executed by the compute nodes c) Leader node will distribute work across compute nodes using Distribution Keys (more later) d) Compute nodes receive work from leader node and may transmit data amongst themselves to answer the query e) Leader aggregates the results and returns to client f) Leader can distribute bulk data loads across compute nodes: I have loaded 3G of raw data (gzipped to 500Mb) on a single node in under 3 minutes) 5 source: http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html
  • 6. What differentiates Redshift from, say, a MySQL RDS instance? (ii) Column-store database a) Relational databases tend to store data on a tuple by tuple basis. b) When querying the data, the engine needs to read more blocks of data, discarding much of the data just read in order to return columns being queried c) A column-store stores columns contiguously in the same block d) Result: the number of IO operations involved in a query can be significantly reduced, dependent on the shape of the data source: http://docs.aws.amazon.com/redshift/latest/dg/c_columnar_storage_disk_mem_mgmnt.html 6
  • 7. What differentiates Redshift from, say, a MySQL RDS instance? (iii) Optimised for Read performance a) Contrast block sizes with other databases: a) Default MySQL installs on ext3 file-systems use 4k blocks b) Default NTFS partitions use 4k blocks, so SQL Server on NTFS defaults to 4k blocks as well. b) Redshift’s focus on Data Warehousing (and hence read optimisation) allows them to use a 1,024KB block size c) Under a column-store architecture each block holds the same kind of data, so datatype-specific compression enables even more data to be stored per block, further reducing disk space and IO d) Reduced disk space and IO helps improve inter-node data sharing and replication, where compute nodes may redistribute data based on a table’s distribution key (more later on that) 7 All your blox are belong to me!
  • 8. What differentiates Redshift from, say, a MySQL RDS instance? (iv) No referential integrity Ok, sounds g… WHAT?!! • • • • • • • • • 8 Do tell Redshift about primary, foreign keys and column uniqueness. It won’t enforce them, but it will use these hints to better understand queries. No primary key No foreign key No index support No sequences No user defined functions No stored procedures No common table expressions No exotic data types – no arrays, JSON, Geospatial types, etc. No ‘alter column’ syntax – drop and reload
  • 9. How does Redshift locate data? The Sort Key • • Redshift will store data on disk in Sort Key order – so think of it as the single clustered index for the table • Sort keys should be selected based on how the table is used: • Columns that are frequently used to join to other tables should be included in the sort key • Date and timestamp columns that are used in filtering operations should be included • Redshift stores metadata about each data block, including the min and max of each column value – using this, Redshift can skip entire blocks when answering a query • 9 Each table can have a single Sort Key – a compound key, comprised of 1 to 400 columns from the table After data loads or inserts, the ANALYZE command should be run • ANALYZE updates the table metadata that is used by the query planner – very important for column-based data storage and ongoing query performance
  • 10. How does Redshift locate data? The Distribution Key • • 10 Redshift will distribute and replicate data between compute nodes in order to get best use of the parallelism available in the cluster • By default, data will be spread evenly across all compute nodes (EVEN distribution) • A node is further broken down into slices – one slice per CPU core • Each slice participates in the parallel execution of a job sent from the Leader node, so the even distribution of data across the nodes is vital to ensuring consistent query performance • If data is denormalised and does not participate in joins, then an EVEN distribution won’t be problematic Alternatively a Distribution key can be provided (KEY distribution) • The Distribution key is important, in that it helps define which data is kept together on a given node. • The objective is to choose a key that helps distribute data across a node’s slices, but not across the cluster’s nodes • Similarly to the Sort Key, the Distribution key is defined on a per-table basis, but unlike a Sort Key, the Distribution Key is comprised of only a single column
  • 11. What typical RDBMS features does Redshift have? Features DataTypes (complete list): • Transactions • SmallInt • Reasonable number of windowing functions • Integer • Rank, First, Last, Lag, Sum, Nth and so on • Bigint • Most types of relational joins • Decimal • Inner, Left, Right, Full, Cross • Real • Correlated sub-queries are supported, but only where • Double precision the query planner can decorrelate them for • Boolean performance (sub-queries during a join are a no-go) • Char • Views • Varchar • Excellent locking and concurrent write capabilities • Date • Thanks PostgreSQL! • Timestamp • Schema management • Identity columns (auto_increment) 11
  • 12. Other features? • Close integration with S3 and DynamoDB • Our test instance was primed from S3: COPY <tableName> from s3://bucket/file.csv.gz header as 1 GZIP • COPY command is central to the import process – can load data in parallel, using what it knows about the structure of the target table to assign work to individual compute nodes • UNLOAD will export data from a Redshift table out to an S3 bucket • Excellent set of database system tables that allow one to monitor pretty much everything that’s going on: • Loads • Queries • Chatter between compute nodes • Sort and distribution keys 12
  • 13. Other features (cont)? • Column compression • Each column can have an optionally assigned compression algorithm, including: • BYTEDICT – essentially a key-value lookup for up to 256 values. Useful for repeating data, such as “State” in a property record • DELTA – stores the initial value of a column as per its data type, and then stores only the offset between the next value and the first value. Very useful for dates • RUNLEGNTH – Stores the value of a column and the number of times the value is repeated. Useful when the data is stored consecutively – relevant for sort-key • MOSTLY8/16/32 – uses traditional numeric compression, but allows for outliers 13
  • 14. Other features (cont)? • Excellent monitoring and management console integration 14
  • 15. How well does Redshift perform? We ran some rudimentary queries over a realistic data set… “Return the current list of all valid properties within a selected list of states and tell me what the current number of bedrooms, bathrooms, car spaces, land size, floor size and year built is.” …and found that Redshift outperformed our existing database by a factory 2.5 – 3.5. Correct select of a SORT KEY in Redshift is vital. Any filtering or joins on a non-sortkey column will result in (slow) a table scan. In our example, this reduced performance by 30%. However, this is not a particularly useful comparison as, these were different machines with different hardware specifications. 15
  • 16. Summary Relational No referential integrity Column Compression Sort key Column store Fast Accessible (JDBC, ODBC, S3) Distribution Key Massive Parallelism Compute Node Leader Node Cluster Architecture Redshift 16