Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013


Published on

Learn how Amazon’s enterprise data warehouse, one of the world's largest data warehouses managing petabytes of data, is leveraging Amazon Redshift. Learn about Amazon's enterprise data warehouse best practices and solutions, and how they’re using Amazon Redshift technology to handle design and scale challenges.

Published in: Technology, Business
  • Be the first to comment

How is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

  1. 1. DAT306 - How, with One of the World’s Largest Data Warehouses, is Leveraging Amazon Redshift Erik Selberg ( and Abhishek Agrawal ( November 14, 2013 © 2013, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc.
  2. 2. Agenda • Amazon Data Warehouse Overview • Amazon Data Warehouse and Amazon Redshift Integration Project • Amazon Redshift Best Practices • Conclusion
  3. 3. Amazon Data Warehouse Overview Erik Selberg <>
  4. 4. Amazon Data Warehouse • Authoritative repository of data for all Amazon • Petabytes of data • Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift • Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP • Not part of AWS
  5. 5. Introducing the Elephant… • Mission: Provide customers the best value – – Leverage AWS only if it provides the best value We aren’t moving 100% to Amazon Redshift • Publish best practices – If AWS isn’t the best, we’ll say so • There is a conflict of interest
  6. 6. Control Plane (ETL Manager) Existing EDW Amazon EMR Amazon Redshift Amazon Data Warehouse Architecture
  7. 7. Amazon Data Warehouse – Growth Story • Petabytes of data • Growth of data volume – YoY storage requirements have grown 67% • Growth of processing volume – YoY processing demand has grown 47%
  8. 8. Long-Term Sustainable Scale $$ Wasted Demand SAN-based Redshift
  9. 9. Coping with Change Growth changes Demand SAN Capacity Unmet Redshift
  10. 10. Amazon Data Warehouse – Cost per Job • Our main efficiency metric – Cost per Job (CPJ) $CapEx $DataCenter $VendorSup port PeakJobsPe rDay
  11. 11. What Drives Cost per Job… Up? Down? • Number of disks – Data gets bigger! • Bidding – 2+ vendors • Number of servers • Moore’s Law – Vendors fight this! • Short-sighted negotiations – 4th year support… • Data design Data Center costs (power, rent) • Software (e.g. DBM) •
  12. 12. Current State and Problems • Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs) – Why not <x>? CPJ not lower • Data stored in SANs (not Exadata) • Performs poorly on scans of 10T+ • Long procurement cycles (3 month minimum)
  13. 13. Amazon Data Warehouse and Amazon Redshift Integration Project • Spent 2013 evaluating Amazon Redshift for Amazon data warehouse – Where does Amazon Redshift provide a better CPJ? – Can Amazon Redshift solve some pain (without introducing new pain)? • Picked 10K jobs and 275 tables to copy
  14. 14. Current State of Affairs • Biggest cluster size: 20+1 8XL • Peak daily jobs: 7211 (using all 4 clusters) • 4159 extracts • 3052 loads
  15. 15. Some Results • Benchmarking for 4159 jobs – Outperforming 2719 – Underperforming 1440 – Avg. runtime • 4:43 mins in Amazon Redshift • 17:38 mins in existing EDW • LOADS are slower • EXTRACTS are faster Job Type RS Performance Category Job Count by Category EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT EXTRACT LOAD LOAD LOAD LOAD LOAD LOAD 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 10X Faster 5X Faster 3X Faster 2X Faster 1X or same 2X Slower 945 487 393 301 480 1150 7 15 23 23 45 290
  16. 16. Amazon Redshift Best Practices Abhishek Agrawal <>
  17. 17. Amazon Redshift Integration Best Practices • Integrating via Amazon S3 (Manifests) • Primary key enforcement • Idempotent loads – – MERGE via INSERT/UPDATE Mimic Trunc-Load [Backfills] • Trunc-partition using sort keys • Administration automation • Ensuring data correctness
  18. 18. Integrating via Amazon S3 • S3 in the US Standard Region is eventually consistent! • S3 LIST might not give the entire list of data right after you save it (this WILL eventually happen to you!) • Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause missing data
  19. 19. Best Practices – Using Amazon S3 • Read/COPY – – System table validation – STL_LOAD_ERRORS, Verify files loaded are ‘intended’ files • Write/ UNLOAD – – System table validation – STL_UNLOAD_LOG Verify all files that has the data are on S3 • Manifests – – – Metadata to know what to exactly to read from S3 Provides authoritative reference to data Powerful in terms of user metadata format, encryption, etc.
  20. 20. Primary Key Enforcement • Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality • Best practice – Introduce temp table to check duplicates in incoming data – Validate against incoming data to catch offenders – Put the data in target table and validate target data in the same transaction before commit • Yes, this IS a lot of overhead
  21. 21. Idempotent Loads • Idempotent Loads – doing a load 2+ times the same as doing 1 load – Needed to manage load failures • MERGE – leverages primary key, row at a time • TRUNC / INSERT – load a partition at a time
  22. 22. MERGE • No native Amazon Redshift MERGE support • Merge is implemented as a multi-step process – – – – Load the data in temp table Figure out inserts and load Figure out updates and modify target table Validation for duplicates
  23. 23. TRUNC - INSERT • Solution – Distribute randomly – Use sort keys to align data (mimics partition) – Selectively delete and insert • Issues – Inserts are in an “unsorted” bucket – performance degrades without periodic VACUUM – Very slow (effectively row at a time)
  24. 24. Other Temp Table Uses • Partial column data load • Filtered data load • Column transformations
  25. 25. Automating Administration • Stored procs / Oracle workflow used to do admin task like retention, stats, etc. • Solution – We introduced a software layer to prepare the administrative task statements based on defined inputs – Execute using JDBC connection – Can schedule work like stats collection, vacuum, etc.
  26. 26. 2013 Results • CPJ is 55% less on Amazon Redshift in general – – – – We can’t share the math, sorry YMMV Between Redshift and Amazon data warehouse, known improvements get us to ~66% Big wins are in big queries Loads are slow and expensive • Moved ~10K jobs to ~60 8XLs (4 clusters) • We could move at most 45% of our work to Amazon Redshift with minimal changes
  27. 27. 2014 Plan • Focus on big tables (100T+) – Need to solve data expiry and backfill challenges • Solve problems with CPU bound • Interactive analytics (third-party vendor apps with Amazon Redshift + Oracle)
  28. 28. Please give us your feedback on this presentation DAT306 As a thank you, we will select prize winners daily for completed surveys!