How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013
 

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

on

  • 926 views

Learn how Amazon’s enterprise data warehouse, one of the world's largest data warehouses managing petabytes of data, is leveraging Amazon Redshift. Learn about Amazon's enterprise data warehouse ...

Learn how Amazon’s enterprise data warehouse, one of the world's largest data warehouses managing petabytes of data, is leveraging Amazon Redshift. Learn about Amazon's enterprise data warehouse best practices and solutions, and how they’re using Amazon Redshift technology to handle design and scale challenges.

Statistics

Views

Total Views
926
Views on SlideShare
926
Embed Views
0

Actions

Likes
1
Downloads
28
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013 How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013 Presentation Transcript

  • DAT306 - How Amazon.com, with One of the World’s Largest Data Warehouses, is Leveraging Amazon Redshift Erik Selberg (selberg@amazon.com) and Abhishek Agrawal (abhagrwa@amazon.com) November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Agenda • Amazon Data Warehouse Overview • Amazon Data Warehouse and Amazon Redshift Integration Project • Amazon Redshift Best Practices • Conclusion
  • Amazon Data Warehouse Overview Erik Selberg <selberg@amazon.com>
  • Amazon Data Warehouse • Authoritative repository of data for all Amazon • Petabytes of data • Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift • Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP • Not part of AWS
  • Introducing the Elephant… • Mission: Provide customers the best value – – Leverage AWS only if it provides the best value We aren’t moving 100% to Amazon Redshift • Publish best practices – If AWS isn’t the best, we’ll say so • There is a conflict of interest
  • Control Plane (ETL Manager) Existing EDW Amazon EMR Amazon Redshift Amazon Data Warehouse Architecture
  • Amazon Data Warehouse – Growth Story • Petabytes of data • Growth of data volume – YoY storage requirements have grown 67% • Growth of processing volume – YoY processing demand has grown 47%
  • Long-Term Sustainable Scale $$ Wasted Demand SAN-based Redshift
  • Coping with Change Growth changes Demand SAN Capacity Unmet Redshift
  • Amazon Data Warehouse – Cost per Job • Our main efficiency metric – Cost per Job (CPJ) $CapEx $DataCenter $VendorSup port PeakJobsPe rDay
  • What Drives Cost per Job… Up? Down? • Number of disks – Data gets bigger! • Bidding – 2+ vendors • Number of servers • Moore’s Law – Vendors fight this! • Short-sighted negotiations – 4th year support… • Data design Data Center costs (power, rent) • Software (e.g. DBM) •
  • Current State and Problems • Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs) – Why not <x>? CPJ not lower • Data stored in SANs (not Exadata) • Performs poorly on scans of 10T+ • Long procurement cycles (3 month minimum)
  • Amazon Data Warehouse and Amazon Redshift Integration Project • Spent 2013 evaluating Amazon Redshift for Amazon data warehouse – Where does Amazon Redshift provide a better CPJ? – Can Amazon Redshift solve some pain (without introducing new pain)? • Picked 10K jobs and 275 tables to copy
  • Current State of Affairs • Biggest cluster size: 20+1 8XL • Peak daily jobs: 7211 (using all 4 clusters) • 4159 extracts • 3052 loads
  • Some Results • Benchmarking for 4159 jobs – Outperforming 2719 – Underperforming 1440 – Avg. runtime • 4:43 mins in Amazon Redshift • 17:38 mins in existing EDW • LOADS are slower • EXTRACTS are faster Job Type RS Performance Category Job Count by Category EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT LOAD LOAD LOAD LOAD LOAD LOAD 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 945 487 393 301 480 1150 7 15 23 23 45 290
  • Amazon Redshift Best Practices Abhishek Agrawal <abhagrwa@amazon.com>
  • Amazon Redshift Integration Best Practices • Integrating via Amazon S3 (Manifests) • Primary key enforcement • Idempotent loads – – MERGE via INSERT/UPDATE Mimic Trunc-Load [Backfills] • Trunc-partition using sort keys • Administration automation • Ensuring data correctness
  • Integrating via Amazon S3 • S3 in the US Standard Region is eventually consistent! • S3 LIST might not give the entire list of data right after you save it (this WILL eventually happen to you!) • Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause missing data
  • Best Practices – Using Amazon S3 • Read/COPY – – System table validation – STL_LOAD_ERRORS, Verify files loaded are ‘intended’ files • Write/ UNLOAD – – System table validation – STL_UNLOAD_LOG Verify all files that has the data are on S3 • Manifests – – – Metadata to know what to exactly to read from S3 Provides authoritative reference to data Powerful in terms of user metadata format, encryption, etc.
  • Primary Key Enforcement • Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality • Best practice – Introduce temp table to check duplicates in incoming data – Validate against incoming data to catch offenders – Put the data in target table and validate target data in the same transaction before commit • Yes, this IS a lot of overhead
  • Idempotent Loads • Idempotent Loads – doing a load 2+ times the same as doing 1 load – Needed to manage load failures • MERGE – leverages primary key, row at a time • TRUNC / INSERT – load a partition at a time
  • MERGE • No native Amazon Redshift MERGE support • Merge is implemented as a multi-step process – – – – Load the data in temp table Figure out inserts and load Figure out updates and modify target table Validation for duplicates
  • TRUNC - INSERT • Solution – Distribute randomly – Use sort keys to align data (mimics partition) – Selectively delete and insert • Issues – Inserts are in an “unsorted” bucket – performance degrades without periodic VACUUM – Very slow (effectively row at a time)
  • Other Temp Table Uses • Partial column data load • Filtered data load • Column transformations
  • Automating Administration • Stored procs / Oracle workflow used to do admin task like retention, stats, etc. • Solution – We introduced a software layer to prepare the administrative task statements based on defined inputs – Execute using JDBC connection – Can schedule work like stats collection, vacuum, etc.
  • 2013 Results • CPJ is 55% less on Amazon Redshift in general – – – – We can’t share the math, sorry YMMV Between Redshift and Amazon data warehouse, known improvements get us to ~66% Big wins are in big queries Loads are slow and expensive • Moved ~10K jobs to ~60 8XLs (4 clusters) • We could move at most 45% of our work to Amazon Redshift with minimal changes
  • 2014 Plan • Focus on big tables (100T+) – Need to solve data expiry and backfill challenges • Solve problems with CPU bound • Interactive analytics (third-party vendor apps with Amazon Redshift + Oracle)
  • Please give us your feedback on this presentation DAT306 As a thank you, we will select prize winners daily for completed surveys!