SlideShare a Scribd company logo
Amazon Redshift 
Jeff Patti
What is Redshift? 
“Redshift is a fast, fully managed, petabyte-scale 
data warehouse service” 
-Amazon 
With Redshift Monetate is able to generate all of our 
analytics data for a day in ~ 2 hours 
A process that consumes billions of rows and yields millions
What isn’t Redshift? 
warehouse=# insert into fact_page_view values 
warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); 
INSERT 0 1 
Time: 4600.094 ms 
warehouse=# select fact_time from fact_page_view 
warehouse-# where fact_date = '2014-10-02'; 
fact_time 
--------------------- 
2014-10-02 18:30:00 
(1 row) 
Time: 618.303 ms
Who am I? 
Jeff Patti 
jeffpatti@gmail.com 
Backend Engineer at Monetate 
Monetate was in Redshifts Beta in late 2012 
and has been actively developing on it since. 
We’re hiring - monetate.com/jobs/
Leaving Hive For Redshift 
● Unusual failure modes 
● Slower and pricier than 
Redshift, at least in our 
configuration 
● Custom query language 
○ Didn’t play nicely with 
our sql libraries 
● Fully Managed 
● Performant & Scalable 
● Excellent integration with 
other AWS offerings 
● PostgreSQL interface 
○ command line interface 
○ libraries for PostgreSQL 
work against Redshift
Fully Managed 
● Easy to deploy 
● Easy to scale out 
● Software updates - handled 
● Hardware failures - taken care of 
● Automatic backups - baked in
Automatic Backups 
● Periodically taken as delta from prior backup 
● Easy to create new cluster from backup, or 
overwrite existing cluster 
● Queryable during recovery, after short delay 
○ Preferentially recovers needed blocks to perform 
commands 
● This is how Monetate keeps our 
development cluster in sync with production
Maintenance Window 
● Required half hour window once a week for 
routine maintenance, such as software 
updates 
● During this time the cluster is unresponsive 
● You pick when it happens
Scaling Out 
You: Change cluster size through AWS console 
AWS: 
1. Existing cluster put into read only state 
2. New cluster caught up with existing cluster 
3. Swapped during maintenance window, 
unless specified as immediate 
a. Immediate swap causes temporary unavailability 
during canonical name record swap ( a few minutes)
Monetate 
● Core products are merchandising, web & 
email personalization, testing 
● A/B & Multivariate testing to determine 
impact of experiments 
● Involved with >20% of US ecommerce spend 
each holiday season for the past 3 years 
running
Monetate Data Collection 
To compute analytics and reports on our clients 
experiments, for that we collect a lot of data 
● Billions of page views a week 
● Billions of experiment views a week 
● Millions of purchases a week 
● etc. 
This is where Redshift comes in handy
Redshift In Monetate 
App 
App 
App 
App 
App 
Monetate is Multi-region 
& Multi-AZ 
in AWS 
Amazon 
S3 
Amazon 
Redshift 
Our 
Clients 
Data Warehousing Analytics & Reporting
Under The Covers 
● Fork of PostgreSQL 8.0.2, get nice things like 
○ Common Table Expressions 
○ Window Functions 
● Column oriented database 
● Clusters can have many machines 
○ Each machine has many slices 
○ Queries run in parallel on all slices 
● Concurrent query support & memory limiting
Instance Types
Query Concurrency
Example Redshift Table 
CREATE TABLE fact_url ( 
fact_date DATE NOT NULL ENCODE lzo, 
account_id INT NOT NULL ENCODE lzo, 
fact_time TIMESTAMP NOT NULL ENCODE lzo, 
mid BIGINT NOT NULL ENCODE lzo, 
uri VARCHAR(2048) ENCODE lzo, 
referer_uri VARCHAR(2048) ENCODE lzo, 
PRIMARY KEY (account_id, fact_time, mid) 
) 
DISTKEY (mid) 
SORTKEY (fact_date, account_id, fact_time, mid);
Per Column Compression 
● Used to fit more rows in each 1MB block 
● Trade off between CPU and IO 
● Allows Redshift to read rows from disk faster 
● Has to use more CPU to decompress data 
● Our Redshift queries are IO bound 
○ We use compression extensively
Constraints 
“Uniqueness, primary key, and foreign key 
constraints are informational only; they are not 
enforced by Amazon Redshift.” 
However, “If your application allows invalid 
foreign keys or primary keys, some queries 
could return incorrect results.” [emphasis added]
Distribution Style 
Controls how Redshift distributes rows 
● Styles 
○ Even - round robin rows (default) 
○ Key - data with the same key goes to same slice 
■ Based on a single column from the table 
○ All - data is copied to all slices 
■ Good for small tables
DISTKEY impacts Joins 
DS_DIST_NONE 
No redistribution is required, because 
corresponding slices are collocated on the 
compute nodes. You will typically have only one 
DS_DIST_NONE step, the join between the fact 
table and one dimension table. 
DS_DIST_ALL_NONE 
No redistribution is required, because the inner 
join table used DISTSTYLE ALL. The entire 
table is located on every node. 
These two are very performant 
DS_DIST_INNER 
The inner table is redistributed. 
DS_BCAST_INNER 
A copy of the entire inner table is broadcast to all 
the compute nodes. 
DS_DIST_ALL_INNER 
The entire inner table is redistributed to a single 
slice because the outer table uses DISTSTYLE 
ALL. 
DS_DIST_BOTH 
Both tables are redistributed.
Query Plan From Explain 
-> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) 
Hash Cond: ("outer".venueid = "inner".venueid) 
-> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) 
Hash Cond: ("outer".eventid = "inner".eventid) 
-> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) 
Merge Cond: ("outer".listid = "inner".listid) 
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) 
-> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
Sort Key 
● Data is stored on disk in sorted order 
○ After being inserted into an empty table, or vacuumed 
● Sort Key impacts vacuum performance 
● Columnar data stored in 1MB blocks 
○ min/max data stored as metadata 
● Metadata used to improve query performance 
○ Allows Redshift to skip unnecessary blocks
Sort Key Take 1 
SORTKEY (account_id, fact_time, mid) 
● As we added new facts, bad things started happening 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered 
● Resorting rows for vacuuming had to reorder almost all the rows :( 
● This made vacuuming unreasonably slow, affecting how often we could 
vacuum and therefore query performance 
new facts for all 
accounts 
account 1 
time ordered 
account 2 
time ordered 
... account n 
time ordered
Sort Key Take 2 
SORTKEY (fact_time, account_id, mid) 
● Now our table is like an append only log, but had poor query performance 
00:00 
account ordered 
00:01 
account ordered 
● For many of our queries, we only look at one account at a time 
● Redshift blocks are 1MB each, each spanned many accounts 
● When querying a single account, had to read from disk and ignore many 
rows from other accounts 
... Now 
account ordered
Sort Key Take 3 
SORTKEY (fact_date, account_id, fact_time, mid) 
Jan 1st 
account ordered 
Jan 2nd 
account ordered 
● Append only log ✓ 
○ Cheap vacuuming ✓ 
... Today 
● Single or few accounts per block ✓ 
account ordered 
○ Significantly improved query performance ✓
Redshift ⇔ S3 
Redshift & S3 have excellent integration 
● Unload from Redshift to S3 via UNLOAD 
○ Each slice unloads separately to S3 
○ We unload into a CSV format 
● Load into Redshift from S3 via COPY 
○ Applies all as inserts 
○ Primary keys aren’t enforced by Redshift 
■ Use staging table to detect duplicate keys
Redshift UNLOAD 
unload ('select * from venue order by venueid') 
to 's3://mybucket/tickit/venue/reload_' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Redshift UNLOAD Tip 
unload ('select * from venue order by venueid') 
● Query in unload is quoted which wreaks havoc with 
quotes around dates, fact_time <= '2014-10-02' 
● Instead of escaping the quotes around the date times 
○ unload ($$ select * from venue order by 
venueid $$)
Redshift COPY 
copy venue 
from 's3://mybucket/tickit/venue/reload_manifest' 
credentials 'aws_access_key_id=<access-key-id>; 
aws_secret_access_key=<secret-access-key>' 
manifest 
delimiter ',';
Try it Yourself! For Free!!! 
Amazon Redshift documentation is well written 
It contains great tutorials with pricing estimates 
Amazon offers a 750 hour free trial of redshift 
DW2.Large nodes
Questions?

More Related Content

What's hot

BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
Paulraj Pappaiah
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
Amazon Web Services
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Amazon Web Services
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
Amazon Web Services
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
Amazon Web Services
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
Caserta
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
Amazon Web Services
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
Kel Graham
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
Amazon Web Services Korea
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討
Amazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
Amazon Web Services
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
Amazon Web Services
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
Amazon Web Services
 
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
Amazon Web Services
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Amazon Web Services Korea
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Amazon Web Services
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Web Services
 

What's hot (20)

BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2BigData: AWS RedShift with S3, EC2
BigData: AWS RedShift with S3, EC2
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Deep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDBDeep Dive: Amazon DynamoDB
Deep Dive: Amazon DynamoDB
 
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesDeep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series
 
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDBAWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
AWS December 2015 Webinar Series - Design Patterns using Amazon DynamoDB
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討Amazon DynamoDB 深入探討
Amazon DynamoDB 深入探討
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
SmugMug: From MySQL to Amazon DynamoDB (DAT204) | AWS re:Invent 2013
 
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB DayGetting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
Getting Strated with Amazon Dynamo DB (Jim Scharf) - AWS DB Day
 
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012Dynamo DB & RDS Deep Dive - AWS India Summit 2012
Dynamo DB & RDS Deep Dive - AWS India Summit 2012
 
Amazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech TalksAmazon Redshift Deep Dive - February Online Tech Talks
Amazon Redshift Deep Dive - February Online Tech Talks
 

Similar to Amazon Redshift

Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Amazon Web Services
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
Rob Skillington
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
MariaDB plc
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
MariaDB plc
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
SnapLogic
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
Amazon Web Services
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
AWS Germany
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
Amazon Web Services
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationVolodymyr Rovetskiy
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
VictoriaMetrics
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Altinity Ltd
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
Amazon Web Services Korea
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
MongoDB
 

Similar to Amazon Redshift (20)

Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
FOSDEM 2019: M3, Prometheus and Graphite with metrics and monitoring in an in...
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce Argus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
 
2017 AWS DB Day | Amazon Redshift 소개 및 실습
2017 AWS DB Day | Amazon Redshift  소개 및 실습2017 AWS DB Day | Amazon Redshift  소개 및 실습
2017 AWS DB Day | Amazon Redshift 소개 및 실습
 
Really Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DWReally Big Elephants: PostgreSQL DW
Really Big Elephants: PostgreSQL DW
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Amazon Redshift

  • 2. What is Redshift? “Redshift is a fast, fully managed, petabyte-scale data warehouse service” -Amazon With Redshift Monetate is able to generate all of our analytics data for a day in ~ 2 hours A process that consumes billions of rows and yields millions
  • 3. What isn’t Redshift? warehouse=# insert into fact_page_view values warehouse-# ('2014-10-02', 1, '2014-10-02 18:30', 2, 3, 4); INSERT 0 1 Time: 4600.094 ms warehouse=# select fact_time from fact_page_view warehouse-# where fact_date = '2014-10-02'; fact_time --------------------- 2014-10-02 18:30:00 (1 row) Time: 618.303 ms
  • 4. Who am I? Jeff Patti jeffpatti@gmail.com Backend Engineer at Monetate Monetate was in Redshifts Beta in late 2012 and has been actively developing on it since. We’re hiring - monetate.com/jobs/
  • 5. Leaving Hive For Redshift ● Unusual failure modes ● Slower and pricier than Redshift, at least in our configuration ● Custom query language ○ Didn’t play nicely with our sql libraries ● Fully Managed ● Performant & Scalable ● Excellent integration with other AWS offerings ● PostgreSQL interface ○ command line interface ○ libraries for PostgreSQL work against Redshift
  • 6. Fully Managed ● Easy to deploy ● Easy to scale out ● Software updates - handled ● Hardware failures - taken care of ● Automatic backups - baked in
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Automatic Backups ● Periodically taken as delta from prior backup ● Easy to create new cluster from backup, or overwrite existing cluster ● Queryable during recovery, after short delay ○ Preferentially recovers needed blocks to perform commands ● This is how Monetate keeps our development cluster in sync with production
  • 13.
  • 14. Maintenance Window ● Required half hour window once a week for routine maintenance, such as software updates ● During this time the cluster is unresponsive ● You pick when it happens
  • 15. Scaling Out You: Change cluster size through AWS console AWS: 1. Existing cluster put into read only state 2. New cluster caught up with existing cluster 3. Swapped during maintenance window, unless specified as immediate a. Immediate swap causes temporary unavailability during canonical name record swap ( a few minutes)
  • 16. Monetate ● Core products are merchandising, web & email personalization, testing ● A/B & Multivariate testing to determine impact of experiments ● Involved with >20% of US ecommerce spend each holiday season for the past 3 years running
  • 17. Monetate Data Collection To compute analytics and reports on our clients experiments, for that we collect a lot of data ● Billions of page views a week ● Billions of experiment views a week ● Millions of purchases a week ● etc. This is where Redshift comes in handy
  • 18. Redshift In Monetate App App App App App Monetate is Multi-region & Multi-AZ in AWS Amazon S3 Amazon Redshift Our Clients Data Warehousing Analytics & Reporting
  • 19. Under The Covers ● Fork of PostgreSQL 8.0.2, get nice things like ○ Common Table Expressions ○ Window Functions ● Column oriented database ● Clusters can have many machines ○ Each machine has many slices ○ Queries run in parallel on all slices ● Concurrent query support & memory limiting
  • 22. Example Redshift Table CREATE TABLE fact_url ( fact_date DATE NOT NULL ENCODE lzo, account_id INT NOT NULL ENCODE lzo, fact_time TIMESTAMP NOT NULL ENCODE lzo, mid BIGINT NOT NULL ENCODE lzo, uri VARCHAR(2048) ENCODE lzo, referer_uri VARCHAR(2048) ENCODE lzo, PRIMARY KEY (account_id, fact_time, mid) ) DISTKEY (mid) SORTKEY (fact_date, account_id, fact_time, mid);
  • 23. Per Column Compression ● Used to fit more rows in each 1MB block ● Trade off between CPU and IO ● Allows Redshift to read rows from disk faster ● Has to use more CPU to decompress data ● Our Redshift queries are IO bound ○ We use compression extensively
  • 24. Constraints “Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.” However, “If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.” [emphasis added]
  • 25. Distribution Style Controls how Redshift distributes rows ● Styles ○ Even - round robin rows (default) ○ Key - data with the same key goes to same slice ■ Based on a single column from the table ○ All - data is copied to all slices ■ Good for small tables
  • 26. DISTKEY impacts Joins DS_DIST_NONE No redistribution is required, because corresponding slices are collocated on the compute nodes. You will typically have only one DS_DIST_NONE step, the join between the fact table and one dimension table. DS_DIST_ALL_NONE No redistribution is required, because the inner join table used DISTSTYLE ALL. The entire table is located on every node. These two are very performant DS_DIST_INNER The inner table is redistributed. DS_BCAST_INNER A copy of the entire inner table is broadcast to all the compute nodes. DS_DIST_ALL_INNER The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL. DS_DIST_BOTH Both tables are redistributed.
  • 27. Query Plan From Explain -> XN Hash Join DS_DIST_ALL_NONE (cost=112.50..14142.59 rows=170771 width=84) Hash Cond: ("outer".venueid = "inner".venueid) -> XN Hash Join DS_DIST_ALL_NONE (cost=109.98..10276.71 rows=172456 width=47) Hash Cond: ("outer".eventid = "inner".eventid) -> XN Merge Join DS_DIST_NONE (cost=0.00..6286.47 rows=172456 width=30) Merge Cond: ("outer".listid = "inner".listid) -> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=14) -> XN Seq Scan on sales (cost=0.00..1724.56 rows=172456 width=24)
  • 28. Sort Key ● Data is stored on disk in sorted order ○ After being inserted into an empty table, or vacuumed ● Sort Key impacts vacuum performance ● Columnar data stored in 1MB blocks ○ min/max data stored as metadata ● Metadata used to improve query performance ○ Allows Redshift to skip unnecessary blocks
  • 29. Sort Key Take 1 SORTKEY (account_id, fact_time, mid) ● As we added new facts, bad things started happening account 1 time ordered account 2 time ordered ... account n time ordered ● Resorting rows for vacuuming had to reorder almost all the rows :( ● This made vacuuming unreasonably slow, affecting how often we could vacuum and therefore query performance new facts for all accounts account 1 time ordered account 2 time ordered ... account n time ordered
  • 30. Sort Key Take 2 SORTKEY (fact_time, account_id, mid) ● Now our table is like an append only log, but had poor query performance 00:00 account ordered 00:01 account ordered ● For many of our queries, we only look at one account at a time ● Redshift blocks are 1MB each, each spanned many accounts ● When querying a single account, had to read from disk and ignore many rows from other accounts ... Now account ordered
  • 31. Sort Key Take 3 SORTKEY (fact_date, account_id, fact_time, mid) Jan 1st account ordered Jan 2nd account ordered ● Append only log ✓ ○ Cheap vacuuming ✓ ... Today ● Single or few accounts per block ✓ account ordered ○ Significantly improved query performance ✓
  • 32. Redshift ⇔ S3 Redshift & S3 have excellent integration ● Unload from Redshift to S3 via UNLOAD ○ Each slice unloads separately to S3 ○ We unload into a CSV format ● Load into Redshift from S3 via COPY ○ Applies all as inserts ○ Primary keys aren’t enforced by Redshift ■ Use staging table to detect duplicate keys
  • 33. Redshift UNLOAD unload ('select * from venue order by venueid') to 's3://mybucket/tickit/venue/reload_' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 34. Redshift UNLOAD Tip unload ('select * from venue order by venueid') ● Query in unload is quoted which wreaks havoc with quotes around dates, fact_time <= '2014-10-02' ● Instead of escaping the quotes around the date times ○ unload ($$ select * from venue order by venueid $$)
  • 35. Redshift COPY copy venue from 's3://mybucket/tickit/venue/reload_manifest' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-access-key>' manifest delimiter ',';
  • 36. Try it Yourself! For Free!!! Amazon Redshift documentation is well written It contains great tutorials with pricing estimates Amazon offers a 750 hour free trial of redshift DW2.Large nodes