Your SlideShare is downloading. ×
0
ARCHITECTING IN AWS
for resilience & cost at scale
Jos Boumans - @jiboumans
http://rafaykhan619.wix.com/downhouse
Thursday...
RIPE NCC
Engineering manager for RIPE Database
http://www.ripe.net/db
Thursday 22 August 13
CANONICAL
http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775
Engineering manager for Ubuntu Server 10.04 & 10.1...
KRUX
VP of Operations & Infrastructure
http://www.krux.com/
Thursday 22 August 13
SOME OF OUR CUSTOMERS
Thursday 22 August 13
LOTS OFTRAFFIC
http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
Thursday 22 August 13
AVERAGE REQUESTS* / SEC
http://mashable.com/2013/03/21/happy-7th-birthday-twitter/
http://stats.wikimedia.org/EN/TablesPag...
MONTHLY UNIQUE USERS
0 200,000,000 400,000,000 600,000,000 800,000,000
http://en.wikipedia.org/wiki/Wikipedia
http://masha...
WE CHOSE 'THE CLOUD'
http://previewnetworks.com/blog/
Thursday 22 August 13
THERE ARE DOWNSIDES
http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
Thursday 22 August 13
RESILIENCE & COST AT SCALE
Thursday 22 August 13
FOCUS ON AWS
http://aws.amazon.com/
Thursday 22 August 13
APRIL 21, 2011
http://aws.amazon.com/message/680587/
http://aws.amazon.com/message/680342/
http://aws.amazon.com/message/6...
So#ware,)8)
Automa/on,)4)
Process,)14)
#"of"Issues"
ROOT CAUSE CATEGORIES
http://www.slideshare.net/rahultyagi50999/amazon...
JUNE 29, 2012
http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com...
AWS OUTAGE =YOUR OUTAGE
http://it.mario.wikia.com/wiki/Lakitu
Thursday 22 August 13
RESILIENCE @ SCALE
Embrace Failure: Hardware will fail. Humans will make errors.
Nature will produce thunderstorms.
http:/...
DEFINE 'AVAILABLE'
Things will break, so choose your degraded state.
http://libcom.org/library/occupied-wall-street-some-t...
BASIC API CALL
3 potential points of failure
Thursday 22 August 13
FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varni...
FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varni...
FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varni...
FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varni...
FALLBACK PATTERNS
The cost of resilience should be accuracy or latency
http://redis.io/
http://memcached.org/
http://varni...
USER EXPERIENCE
My tweet got posted
Thursday 22 August 13
RESILIENCETOOLS
Storage, Network & ACL
http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html
Thursday 22 Augu...
MANY SMALL NODESVERSUS
A FEW LARGER NODES
The benefits of the many outweigh the benefits of the few
http://www.stealingfaith...
DATABASES
CAPTheorem applies.
Your choice: sacrifice availability or consistency. Orange is a lie.
RDBMS
BigTable Based
Mas...
SIMPLE STORAGE SERVICE
S3:Arguably AWS' best feature
http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
http:/...
CACHE WHATYOU CAN
HTTP Responses, DB Queries, User content
Browsers have caches too!
http://cruncht.com/95/drupal-caching/...
CLIENT SIDE STORAGE
Keep a copy of your users data locally
http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp:/...
USE ELASTIC LOAD BALANCERS
They will save you more than once
http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
Thu...
USE GLOBAL LOAD BALANCING
Fail over to the closest data center on region failure
Thursday 22 August 13
SHOUT OUT: DYN
DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc
http://dyn.com
Thursday 22 August 13
USE IAM ROLES FOR ACCESS
Humans make mistakes, including your humans
Thursday 22 August 13
COST @ SCALE
Scaling without breaking the bank
http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg
Thursday 22 A...
EMR + SPOT INSTANCES
On demand rate: $0.165 / hour
http://aws.amazon.com/ec2/spot-instances/
Thursday 22 August 13
AMAZON REDSHIFT
Economical Business Intelligence
Scales with data size
http://www.flitemedia.com/music.php
http://aws.amaz...
AMAZON GLACIER
"Tapes for the Cloud Era"
Writes vastly cheaper than reads
http://aws.amazon.com/glacier/http://www.gorp.co...
AWS SIMPLE EMAIL SERVICE
Dealing with email is boring and time consuming
http://aws.amazon.com/ses/http://bfsdaniels.copyc...
AWS SIMPLE QUEUE SERVICE
Excellent for latency insensitive, small volume queues
http://www.toledoblade.com/Retail/2013/01/...
INSTANCE MARKETPLACE
Buy & sell reserved instances
http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http:/...
AWS DYNAMO DB
Excellent for small keys & high read rates
at known & consistent IOPS
http://hlbike.en.ecplaza.net/2.jpg htt...
MAXIMIZE IOPS
RAID 0 Ephemeral drives
use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS
http://calculator.s3.a...
RED FLAGS
Anti-patterns to watch out for
http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-m...
PROVISIONED IOPS EBS
Ephemeral storage on c1/m1.xlarge or SSD is better
If you must: m*large or c1.xlarge for dedicated NI...
AWS DYNAMO DB
For high write rates or
large/variable keys
http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/progr...
HIGH IO/DISK/RAM NODES
Use them deliberately
http://elledecoration.co.za/2010/07/gigantic/
Thursday 22 August 13
AWS CLOUDWATCH
Metric collection,Amazon style
Cost prohibitive & resolution too low
http://www.flickr.com/photos/65683080@...
LOWER COST PER METRIC
Use graphite & statsd
http://graphite.wikidot.com/
https://github.com/etsy/statsd
Thursday 22 August...
HOSTED ALTERNATIVES
Circonus:All the insights you ever wanted
StackDriver: Optimized for AWS
http://circonus.com
http://st...
AWS CLOUDFORMATION
Templatize your entire stack
Harder to use as complexity increases
http://aws.amazon.com/cloudwatch/htt...
RDS FOR ANALYTICS/REPORTS
Paying OLTP prices for BI usage
Sharding will be a matter of time
http://nerds.airbnb.com/redshi...
Q & A
http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html
@jiboumans
http://slideshare.net/jib...
Upcoming SlideShare
Loading in...5
×

Architecting for Failure in AWS - PuppetConf 2013

4,004

Published on

"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital.

Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate.

Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,004
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
40
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Architecting for Failure in AWS - PuppetConf 2013"

  1. 1. ARCHITECTING IN AWS for resilience & cost at scale Jos Boumans - @jiboumans http://rafaykhan619.wix.com/downhouse Thursday 22 August 13
  2. 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db Thursday 22 August 13
  3. 3. CANONICAL http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 Engineering manager for Ubuntu Server 10.04 & 10.10 http://www.ubuntu.com/business/server/overview Thursday 22 August 13
  4. 4. KRUX VP of Operations & Infrastructure http://www.krux.com/ Thursday 22 August 13
  5. 5. SOME OF OUR CUSTOMERS Thursday 22 August 13
  6. 6. LOTS OFTRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html Thursday 22 August 13
  7. 7. AVERAGE REQUESTS* / SEC http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm *Twitter: New tweets Wikipedia: Articles read Krux: New data points 0 3,750 7,500 11,250 15,000 Thursday 22 August 13
  8. 8. MONTHLY UNIQUE USERS 0 200,000,000 400,000,000 600,000,000 800,000,000 http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ Thursday 22 August 13
  9. 9. WE CHOSE 'THE CLOUD' http://previewnetworks.com/blog/ Thursday 22 August 13
  10. 10. THERE ARE DOWNSIDES http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines Thursday 22 August 13
  11. 11. RESILIENCE & COST AT SCALE Thursday 22 August 13
  12. 12. FOCUS ON AWS http://aws.amazon.com/ Thursday 22 August 13
  13. 13. APRIL 21, 2011 http://aws.amazon.com/message/680587/ http://aws.amazon.com/message/680342/ http://aws.amazon.com/message/67457/ http://aws.amazon.com/message/65648/ Also: June 29, 2012 - October 22, 2012 - December 24, 2012 http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ Thursday 22 August 13
  14. 14. So#ware,)8) Automa/on,)4) Process,)14) #"of"Issues" ROOT CAUSE CATEGORIES http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis Software bugs & human error Thursday 22 August 13
  15. 15. JUNE 29, 2012 http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/ Thursday 22 August 13
  16. 16. AWS OUTAGE =YOUR OUTAGE http://it.mario.wikia.com/wiki/Lakitu Thursday 22 August 13
  17. 17. RESILIENCE @ SCALE Embrace Failure: Hardware will fail. Humans will make errors. Nature will produce thunderstorms. http://blabitcanada.com/category/twitter-2/ Thursday 22 August 13
  18. 18. DEFINE 'AVAILABLE' Things will break, so choose your degraded state. http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris Thursday 22 August 13
  19. 19. BASIC API CALL 3 potential points of failure Thursday 22 August 13
  20. 20. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  21. 21. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  22. 22. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  23. 23. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  24. 24. FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  25. 25. USER EXPERIENCE My tweet got posted Thursday 22 August 13
  26. 26. RESILIENCETOOLS Storage, Network & ACL http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html Thursday 22 August 13
  27. 27. MANY SMALL NODESVERSUS A FEW LARGER NODES The benefits of the many outweigh the benefits of the few http://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/ Thursday 22 August 13
  28. 28. DATABASES CAPTheorem applies. Your choice: sacrifice availability or consistency. Orange is a lie. RDBMS BigTable Based Master / Slave based CouchDB Dynamo Based http://ferd.ca/beating-the-cap-theorem-checklist.html Thursday 22 August 13
  29. 29. SIMPLE STORAGE SERVICE S3:Arguably AWS' best feature http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/ http://aws.amazon.com/s3/ https://forums.aws.amazon.com/message.jspa?messageID=182919#182919 Thursday 22 August 13
  30. 30. CACHE WHATYOU CAN HTTP Responses, DB Queries, User content Browsers have caches too! http://cruncht.com/95/drupal-caching/ http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
  31. 31. CLIENT SIDE STORAGE Keep a copy of your users data locally http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ Thursday 22 August 13
  32. 32. USE ELASTIC LOAD BALANCERS They will save you more than once http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/ Thursday 22 August 13
  33. 33. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure Thursday 22 August 13
  34. 34. SHOUT OUT: DYN DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc http://dyn.com Thursday 22 August 13
  35. 35. USE IAM ROLES FOR ACCESS Humans make mistakes, including your humans Thursday 22 August 13
  36. 36. COST @ SCALE Scaling without breaking the bank http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg Thursday 22 August 13
  37. 37. EMR + SPOT INSTANCES On demand rate: $0.165 / hour http://aws.amazon.com/ec2/spot-instances/ Thursday 22 August 13
  38. 38. AMAZON REDSHIFT Economical Business Intelligence Scales with data size http://www.flitemedia.com/music.php http://aws.amazon.com/redshift http://www.tableausoftware.com/ Thursday 22 August 13
  39. 39. AMAZON GLACIER "Tapes for the Cloud Era" Writes vastly cheaper than reads http://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html Thursday 22 August 13
  40. 40. AWS SIMPLE EMAIL SERVICE Dealing with email is boring and time consuming http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/ Thursday 22 August 13
  41. 41. AWS SIMPLE QUEUE SERVICE Excellent for latency insensitive, small volume queues http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.html http://aws.amazon.com/sqs/ http://colby.id.au/benchmarking-sqs Thursday 22 August 13
  42. 42. INSTANCE MARKETPLACE Buy & sell reserved instances http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/ Thursday 22 August 13
  43. 43. AWS DYNAMO DB Excellent for small keys & high read rates at known & consistent IOPS http://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/ Thursday 22 August 13
  44. 44. MAXIMIZE IOPS RAID 0 Ephemeral drives use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS http://calculator.s3.amazonaws.com/calc5.html http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance Thursday 22 August 13
  45. 45. RED FLAGS Anti-patterns to watch out for http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/ Thursday 22 August 13
  46. 46. PROVISIONED IOPS EBS Ephemeral storage on c1/m1.xlarge or SSD is better If you must: m*large or c1.xlarge for dedicated NIC http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nn http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html Thursday 22 August 13
  47. 47. AWS DYNAMO DB For high write rates or large/variable keys http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx Thursday 22 August 13
  48. 48. HIGH IO/DISK/RAM NODES Use them deliberately http://elledecoration.co.za/2010/07/gigantic/ Thursday 22 August 13
  49. 49. AWS CLOUDWATCH Metric collection,Amazon style Cost prohibitive & resolution too low http://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/ Thursday 22 August 13
  50. 50. LOWER COST PER METRIC Use graphite & statsd http://graphite.wikidot.com/ https://github.com/etsy/statsd Thursday 22 August 13
  51. 51. HOSTED ALTERNATIVES Circonus:All the insights you ever wanted StackDriver: Optimized for AWS http://circonus.com http://stackdriver.com Thursday 22 August 13
  52. 52. AWS CLOUDFORMATION Templatize your entire stack Harder to use as complexity increases http://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg Thursday 22 August 13
  53. 53. RDS FOR ANALYTICS/REPORTS Paying OLTP prices for BI usage Sharding will be a matter of time http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/ Thursday 22 August 13
  54. 54. Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans Thursday 22 August 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×