AWS: Architecting for resilience & cost at scale

3,220 views
3,030 views

Published on

As anyone using AWS will be able to tell you, there are good parts, and there are the bad ones. If you come from a datacenter background, you are most definitely not in Kansas anymore, and we had our share of learning experiences as a result.
This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale and operate.
The talk will focus on leveraging the strong/economical points of AWS, while avoiding the weak/expensive ones. I'll give a break down of the pain points, how we managed them and how we avoided painting ourselves in a corner accidentally.
For many companies starting today, success is defined by large traffic or user numbers; if you are one of those companies, these lessons will very likely save you significant operational headaches.

Published in: Technology, Business
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
3,220
On SlideShare
0
From Embeds
0
Number of Embeds
125
Actions
Shares
0
Downloads
0
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

AWS: Architecting for resilience & cost at scale

  1. 1. ARCHITECTING IN AWS for resilience & cost at scale Jos Boumans - @jiboumans http://rafaykhan619.wix.com/downhouse Thursday 22 August 13
  2. 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db Thursday 22 August 13
  3. 3. CANONICAL http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 Engineering manager for Ubuntu Server 10.04 & 10.10 http://www.ubuntu.com/business/server/overview Thursday 22 August 13
  4. 4. KRUX VP of Operations & Infrastructure http://www.krux.com/ Thursday 22 August 13
  5. 5. SOME OF OUR CUSTOMERS Thursday 22 August 13
  6. 6. LOTS OFTRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html Thursday 22 August 13
  7. 7. AVERAGE REQUESTS* / SEC http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm *Twitter: New tweets Wikipedia: Articles read Krux: New data points 0 3,750 7,500 11,250 15,000 Thursday 22 August 13
  8. 8. MONTHLY UNIQUE USERS 0 200,000,000 400,000,000 600,000,000 800,000,000 http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ Thursday 22 August 13
  9. 9. WE CHOSE 'THE CLOUD' http://previewnetworks.com/blog/ Thursday 22 August 13
  10. 10. THERE ARE DOWNSIDES http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines Thursday 22 August 13
  11. 11. RESILIENCE & COST AT SCALE Thursday 22 August 13
  12. 12. FOCUS ON AWS http://aws.amazon.com/ Thursday 22 August 13
  13. 13. APRIL 21, 2011 http://aws.amazon.com/message/680587/ http://aws.amazon.com/message/680342/ http://aws.amazon.com/message/67457/ http://aws.amazon.com/message/65648/ Also: June 29, 2012 - October 22, 2012 - December 24, 2012 http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ Thursday 22 August 13
  14. 14. So#ware,)8) Automa/on,)4) Process,)14) #"of"Issues" ROOT CAUSE CATEGORIES http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis Software bugs & human error Thursday 22 August 13
  15. 15. JUNE 29, 2012 http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/ Thursday 22 August 13
  16. 16. AWS OUTAGE =YOUR OUTAGE http://it.mario.wikia.com/wiki/Lakitu Thursday 22 August 13
  17. 17. RESILIENCE @ SCALE Embrace Failure: Hardware will fail. Humans will make errors. Nature will produce thunderstorms. http://blabitcanada.com/category/twitter-2/ Thursday 22 August 13
  18. 18. DEFINE 'AVAILABLE' Things will break, so choose your degraded state. http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris Thursday 22 August 13
  19. 19. BASIC API CALL 3 potential points of failure Thursday 22 August 13
  20. 20. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  21. 21. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  22. 22. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  23. 23. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  24. 24. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  25. 25. USER EXPERIENCE My tweet got posted Thursday 22 August 13
  26. 26. RESILIENCETOOLS Storage, Network & ACL http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html Thursday 22 August 13
  27. 27. MANY SMALL NODESVERSUS A FEW LARGER NODES The benefits of the many outweigh the benefits of the few http://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/ Thursday 22 August 13
  28. 28. DATABASES CAPTheorem applies. Your choice: sacrifice availability or consistency. Orange is a lie. RDBMS BigTable Based Master / Slave based CouchDB Dynamo Based http://ferd.ca/beating-the-cap-theorem-checklist.html Thursday 22 August 13
  29. 29. SIMPLE STORAGE SERVICE S3:Arguably AWS' best feature http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/ http://aws.amazon.com/s3/ https://forums.aws.amazon.com/message.jspa?messageID=182919#182919 Thursday 22 August 13
  30. 30. CACHE WHATYOU CAN HTTP Responses, DB Queries, User content Browsers have caches too! http://cruncht.com/95/drupal-caching/ http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  31. 31. CLIENT SIDE STORAGE Keep a copy of your users data locally http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ Thursday 22 August 13
  32. 32. USE ELASTIC LOAD BALANCERS They will save you more than once http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/ Thursday 22 August 13
  33. 33. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure Thursday 22 August 13
  34. 34. SHOUT OUT: DYN DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc http://dyn.com Thursday 22 August 13
  35. 35. USE IAM ROLES FOR ACCESS Humans make mistakes, including your humans Thursday 22 August 13
  36. 36. COST @ SCALE Scaling without breaking the bank http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg Thursday 22 August 13
  37. 37. EMR + SPOT INSTANCES On demand rate: $0.165 / hour http://aws.amazon.com/ec2/spot-instances/ Thursday 22 August 13
  38. 38. AMAZON REDSHIFT Economical Business Intelligence Scales with data size http://www.flitemedia.com/music.php http://aws.amazon.com/redshift http://www.tableausoftware.com/ Thursday 22 August 13
  39. 39. AMAZON GLACIER "Tapes for the Cloud Era" Writes vastly cheaper than reads http://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html Thursday 22 August 13
  40. 40. AWS SIMPLE EMAIL SERVICE Dealing with email is boring and time consuming http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/ Thursday 22 August 13
  41. 41. AWS SIMPLE QUEUE SERVICE Excellent for latency insensitive, small volume queues http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.html http://aws.amazon.com/sqs/ http://colby.id.au/benchmarking-sqs Thursday 22 August 13
  42. 42. INSTANCE MARKETPLACE Buy & sell reserved instances http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/ Thursday 22 August 13
  43. 43. AWS DYNAMO DB Excellent for small keys & high read rates at known & consistent IOPS http://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/ Thursday 22 August 13
  44. 44. MAXIMIZE IOPS RAID 0 Ephemeral drives use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS http://calculator.s3.amazonaws.com/calc5.html http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance Thursday 22 August 13
  45. 45. RED FLAGS Anti-patterns to watch out for http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/ Thursday 22 August 13
  46. 46. PROVISIONED IOPS EBS Ephemeral storage on c1/m1.xlarge or SSD is better If you must: m*large or c1.xlarge for dedicated NIC http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nn http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html Thursday 22 August 13
  47. 47. AWS DYNAMO DB For high write rates or large/variable keys http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx Thursday 22 August 13
  48. 48. HIGH IO/DISK/RAM NODES Use them deliberately http://elledecoration.co.za/2010/07/gigantic/ Thursday 22 August 13
  49. 49. AWS CLOUDWATCH Metric collection,Amazon style Cost prohibitive & resolution too low http://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/ Thursday 22 August 13
  50. 50. LOWER COST PER METRIC Use graphite & statsd http://graphite.wikidot.com/ https://github.com/etsy/statsd Thursday 22 August 13
  51. 51. HOSTED ALTERNATIVES Circonus:All the insights you ever wanted StackDriver: Optimized for AWS http://circonus.com http://stackdriver.com Thursday 22 August 13
  52. 52. AWS CLOUDFORMATION Templatize your entire stack Harder to use as complexity increases http://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg Thursday 22 August 13
  53. 53. RDS FOR ANALYTICS/REPORTS Paying OLTP prices for BI usage Sharding will be a matter of time http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/ Thursday 22 August 13
  54. 54. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans Thursday 22 August 13

×