How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

DAT306 - How Amazon.com, with One of the
World’s Largest Data Warehouses, is Leveraging
Amazon Redshift
Erik Selberg (selberg@amazon.com) and
Abhishek Agrawal (abhagrwa@amazon.com)
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Agenda
• Amazon Data Warehouse Overview
• Amazon Data Warehouse and Amazon Redshift
Integration Project
• Amazon Redshift Best Practices
• Conclusion

Amazon Data Warehouse
Overview
Erik Selberg <selberg@amazon.com>

Amazon Data Warehouse
• Authoritative repository of data for all Amazon
• Petabytes of data

• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce
and now Amazon Redshift
• Owns managing the hardware and software infrastructure
–

Apart from Oracle DB, just Amazon IP

• Not part of AWS

Introducing the Elephant…
• Mission: Provide customers the
best value
–
–

Leverage AWS only if it provides the
best value
We aren’t moving 100% to Amazon
Redshift

• Publish best practices
–

If AWS isn’t the best, we’ll say so

• There is a conflict of interest

Control Plane (ETL
Manager)

Existing
EDW

Amazon
EMR

Amazon
Redshift

Amazon Data Warehouse Architecture

Amazon Data Warehouse – Growth Story
• Petabytes of data
• Growth of data volume – YoY storage
requirements have grown 67%
• Growth of processing volume – YoY
processing demand has grown 47%

Long-Term Sustainable Scale
$$ Wasted

Demand
SAN-based

Redshift

Coping with Change
Growth
changes
Demand
SAN

Capacity
Unmet

Redshift

Amazon Data Warehouse – Cost per Job
• Our main efficiency metric – Cost per Job (CPJ)

$CapEx $DataCenter $VendorSup port
PeakJobsPe rDay

What Drives Cost per Job…
Up?

Down?

•

Number of disks
– Data gets bigger!

•

Bidding
– 2+ vendors

•

Number of servers

•

Moore’s Law
– Vendors fight this!

•

Short-sighted negotiations
– 4th year support…

•

Data design

Data Center costs (power, rent)

•

Software (e.g. DBM)

•

Current State and Problems
• Existing EDW
– Multiple multi-petabyte clusters (redundancy and jobs)
– Why not <x>? CPJ not lower

• Data stored in SANs (not Exadata)
• Performs poorly on scans of 10T+
• Long procurement cycles (3 month minimum)

Amazon Data Warehouse and Amazon Redshift
Integration Project
• Spent 2013 evaluating Amazon Redshift for Amazon data
warehouse
– Where does Amazon Redshift provide a better CPJ?
– Can Amazon Redshift solve some pain (without introducing new pain)?

• Picked 10K jobs and 275 tables to copy

Current State of Affairs
• Biggest cluster size: 20+1 8XL
• Peak daily jobs: 7211 (using all 4 clusters)
• 4159 extracts
• 3052 loads

Some Results
• Benchmarking for 4159 jobs
– Outperforming 2719
– Underperforming 1440
– Avg. runtime
• 4:43 mins in Amazon Redshift
• 17:38 mins in existing EDW

• LOADS are slower
• EXTRACTS are faster

Job Type

RS Performance
Category

Job Count by
Category

EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
EXTRACT
LOAD
LOAD
LOAD
LOAD
LOAD
LOAD

10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower
10X Faster
5X Faster
3X Faster
2X Faster
1X or same
2X Slower

945
487
393
301
480
1150
7
15
23
23
45
290

Amazon Redshift Best Practices
Abhishek Agrawal <abhagrwa@amazon.com>

Amazon Redshift Integration Best Practices
•

Integrating via Amazon S3 (Manifests)

•

Primary key enforcement

•

Idempotent loads
–
–

MERGE via INSERT/UPDATE
Mimic Trunc-Load [Backfills]

•

Trunc-partition using sort keys

•

Administration automation

•

Ensuring data correctness

Integrating via Amazon S3
• S3 in the US Standard Region is eventually consistent!
• S3 LIST might not give the entire list of data right after
you save it (this WILL eventually happen to you!)
• Amazon Redshift loads everything it sees in a bucket
– You may see all data files, Amazon Redshift may not, which can cause
missing data

Best Practices – Using Amazon S3
• Read/COPY
–
–

System table validation – STL_LOAD_ERRORS,
Verify files loaded are ‘intended’ files

• Write/ UNLOAD
–
–

System table validation – STL_UNLOAD_LOG
Verify all files that has the data are on S3

• Manifests
–
–
–

Metadata to know what to exactly to read from S3
Provides authoritative reference to data
Powerful in terms of user metadata format, encryption, etc.

Primary Key Enforcement
• Amazon Redshift does not enforce primary key
– You will need to do this to ensure data quality

• Best practice
– Introduce temp table to check duplicates in incoming data
– Validate against incoming data to catch offenders
– Put the data in target table and validate target data in the same
transaction before commit

• Yes, this IS a lot of overhead

Idempotent Loads
• Idempotent Loads – doing a load 2+ times the same as
doing 1 load
– Needed to manage load failures

• MERGE – leverages primary key, row at a time

• TRUNC / INSERT – load a partition at a time

MERGE
• No native Amazon Redshift MERGE support
• Merge is implemented as a multi-step process
–
–
–
–

Load the data in temp table
Figure out inserts and load
Figure out updates and modify target table
Validation for duplicates

TRUNC - INSERT
• Solution
– Distribute randomly
– Use sort keys to align data (mimics partition)
– Selectively delete and insert

• Issues
– Inserts are in an “unsorted” bucket – performance degrades without
periodic VACUUM
– Very slow (effectively row at a time)

Other Temp Table Uses
• Partial column data load
• Filtered data load
• Column transformations

Automating Administration
• Stored procs / Oracle workflow used to do
admin task like retention, stats, etc.
• Solution
– We introduced a software layer to prepare the administrative
task statements based on defined inputs
– Execute using JDBC connection
– Can schedule work like stats collection, vacuum, etc.

2013 Results
• CPJ is 55% less on Amazon Redshift in general
–
–
–
–

We can’t share the math, sorry YMMV
Between Redshift and Amazon data warehouse, known improvements get us to ~66%
Big wins are in big queries
Loads are slow and expensive

• Moved ~10K jobs to ~60 8XLs (4 clusters)

• We could move at most 45% of our work to Amazon Redshift with
minimal changes

2014 Plan
• Focus on big tables (100T+)
– Need to solve data expiry and backfill challenges

• Solve problems with CPU bound
• Interactive analytics (third-party vendor apps
with Amazon Redshift + Oracle)

Please give us your feedback on this
presentation

DAT306
As a thank you, we will select prize
winners daily for completed surveys!

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

Similar to How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013