Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Getting Maximum Performance from Amazon Redshift: Complex Queries

32,250 views

Published on

Slides from Timon Karnezos' talk at AWS re:Invent 2013, session DAT305.

Published in: Technology, Business
  • Be the first to comment

Getting Maximum Performance from Amazon Redshift: Complex Queries

  1. 1. Getting Maximum Performance from Amazon Redshift: Complex Queries Timon Karnezos, Aggregate Knowledge November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Meet the new boss Multi-touch Attribution
  3. 3. Same as the old boss Behavioral Analytics
  4. 4. Same as the old boss Market Basket Analysis
  5. 5. $
  6. 6. We know how to do this, in SQL*! * SQL:2003
  7. 7. Here it is. SELECT record_date, user_id, action, site, revenue, SUM(1) OVER (PARTITION BY user_id ORDER BY record_date ASC) AS position FROM user_activities;
  8. 8. So why is MTA hard?
  9. 9. “Web Scale” Queries     30 queries 1700 lines of SQL 20+ logical phases GBs of output Data  ~109 daily impressions  ~107 daily conversions  ~104 daily sites  x 90 days per report.
  10. 10. So, how do we deliver complex reports over “web scale” data? (Pssst. The answer’s Redshift. Thanks AWS.)
  11. 11. Write (good) queries. Organize the data. Optimize for the humans.
  12. 12. Write (good) queries. Remember: SQL is code.
  13. 13. Software engineering rigor applies to SQL. Factored. Concise. Tested.
  14. 14. Common Table Expression
  15. 15. Factored. Concise. Tested.
  16. 16. Window functions
  17. 17. -- Position in timeline SUM(1) OVER (PARTITION BY user_id ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING) -- Event count in timeline SUM(1) OVER (PARTITION BY user_id ORDER BY record_date DESC BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) -- Transition matrix of sites LAG(site_name) OVER (PARTITION BY user_id ORDER BY record_date DESC) -- Unique sites in timeline, up to now COUNT(DISTINCT site_name) OVER (PARTITION BY user_id ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING)
  18. 18. Window functions Scalable, combinable. Compact but expressive. Simple to reason about.
  19. 19. Organize the data.
  20. 20. Leverage Redshift’s MPP roots. Fast, columnar scans, IO. Fast sort and load. Effective when work is distributable.
  21. 21. Leverage Redshift’s MPP roots. Sort into multiple representations. Materialize shared views. Hash-partition by user_id.
  22. 22. Optimize for the humans.
  23. 23. Operations should not be the bottleneck. Develop without fear. Trade time for money. Scale with impunity.
  24. 24. Operations should not be the bottleneck. Fast S3 = scratch space for cheap Linear query scaling = GTM quicker Dashboard Ops = dev/QA envs, marts, clusters with just a click
  25. 25. But, be frugal.
  26. 26. Quantify and control costs Test across different hardware, clusters. Shut down clusters often. Buy productivity, not bragging rights.
  27. 27. Thank you! References http://bit.ly/rs_ak http://www.adweek.com/news/technology/study-facebook-leads-24-sales-boost-146716 http://en.wikipedia.org/wiki/Behavioral_analytics http://en.wikipedia.org/wiki/Market_basket_analysis
  28. 28. Please give us your feedback on this presentation DAT305 As a thank you, we will select prize winners daily for completed surveys!

×