• Like
  • Save
Architecting for Failure in AWS - PuppetConf 2013
Upcoming SlideShare
Loading in...5
×
 

Architecting for Failure in AWS - PuppetConf 2013

on

  • 3,494 views

"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital. ...

"Architecting for Failure in AWS" by Jos Boumans, VP of Operations, Krux Digital.

Presentation Overview: Krux is an infrastructure provider for many of the websites you use online today, like NYTimes.com, WSJ.com, Wikia and NBCU. For every request on those properties, Krux will get one or more as well. We grew from zero traffic to several billion requests per day in the span of 2 years, and we did so exclusively in AWS. As anyone using AWS will be able to tell you, there's good parts, and there's the bad ones. This is the story of all the pitfalls we encountered, and how, through architecture, convention and common sense, we managed to build an infrastructure that is "Always Up" from the end user perspective and incredibly economical to build, scale & operate.

Speaker Bio: Jos is the VP of Operations at Krux, supporting a platform with over 4 billion requests per day with a tiny Ops team. Every bit of the AWS stack is automated, monitored & graphed, with maximized resilience and minimized cost. In a previous life I ran the Ubuntu Server group at Canonical and the Database group at RIPE, which is responsible for all the authoritative IP address data in Europe, the Middle East & Asia. Jos is a regular speaker at conferences like OSCON, Devoxx, Puppetconf, etc where he mostly speaks on dealing with AWS Operations from all angles.

Statistics

Views

Total Views
3,494
Views on SlideShare
1,526
Embed Views
1,968

Actions

Likes
2
Downloads
32
Comments
0

4 Embeds 1,968

http://puppetlabs.com 1965
https://www.google.com 1
http://scott.wifi.puppetlabs.net 1
http://richards-mbp.corp.puppetlabs.net 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Architecting for Failure in AWS - PuppetConf 2013 Architecting for Failure in AWS - PuppetConf 2013 Presentation Transcript

    • ARCHITECTING IN AWS for resilience & cost at scale Jos Boumans - @jiboumans http://rafaykhan619.wix.com/downhouse Thursday 22 August 13
    • RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/db Thursday 22 August 13
    • CANONICAL http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 Engineering manager for Ubuntu Server 10.04 & 10.10 http://www.ubuntu.com/business/server/overview Thursday 22 August 13
    • KRUX VP of Operations & Infrastructure http://www.krux.com/ Thursday 22 August 13
    • SOME OF OUR CUSTOMERS Thursday 22 August 13
    • LOTS OFTRAFFIC http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html Thursday 22 August 13
    • AVERAGE REQUESTS* / SEC http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ http://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm *Twitter: New tweets Wikipedia: Articles read Krux: New data points 0 3,750 7,500 11,250 15,000 Thursday 22 August 13
    • MONTHLY UNIQUE USERS 0 200,000,000 400,000,000 600,000,000 800,000,000 http://en.wikipedia.org/wiki/Wikipedia http://mashable.com/2013/03/21/happy-7th-birthday-twitter/ Thursday 22 August 13
    • WE CHOSE 'THE CLOUD' http://previewnetworks.com/blog/ Thursday 22 August 13
    • THERE ARE DOWNSIDES http://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines Thursday 22 August 13
    • RESILIENCE & COST AT SCALE Thursday 22 August 13
    • FOCUS ON AWS http://aws.amazon.com/ Thursday 22 August 13
    • APRIL 21, 2011 http://aws.amazon.com/message/680587/ http://aws.amazon.com/message/680342/ http://aws.amazon.com/message/67457/ http://aws.amazon.com/message/65648/ Also: June 29, 2012 - October 22, 2012 - December 24, 2012 http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ Thursday 22 August 13
    • So#ware,)8) Automa/on,)4) Process,)14) #"of"Issues" ROOT CAUSE CATEGORIES http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis Software bugs & human error Thursday 22 August 13
    • JUNE 29, 2012 http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/ Thursday 22 August 13
    • AWS OUTAGE =YOUR OUTAGE http://it.mario.wikia.com/wiki/Lakitu Thursday 22 August 13
    • RESILIENCE @ SCALE Embrace Failure: Hardware will fail. Humans will make errors. Nature will produce thunderstorms. http://blabitcanada.com/category/twitter-2/ Thursday 22 August 13
    • DEFINE 'AVAILABLE' Things will break, so choose your degraded state. http://libcom.org/library/occupied-wall-street-some-tactical-thoughts-malcolm-harris Thursday 22 August 13
    • BASIC API CALL 3 potential points of failure Thursday 22 August 13
    • FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • FALLBACK PATTERNS The cost of resilience should be accuracy or latency http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • USER EXPERIENCE My tweet got posted Thursday 22 August 13
    • RESILIENCETOOLS Storage, Network & ACL http://wordyou.ru/kolonki/my-teper-ne-na-avrore-a-na-titanike.html Thursday 22 August 13
    • MANY SMALL NODESVERSUS A FEW LARGER NODES The benefits of the many outweigh the benefits of the few http://www.stealingfaith.com/2012/07/08/throw-off-the-tiny-ropes/ Thursday 22 August 13
    • DATABASES CAPTheorem applies. Your choice: sacrifice availability or consistency. Orange is a lie. RDBMS BigTable Based Master / Slave based CouchDB Dynamo Based http://ferd.ca/beating-the-cap-theorem-checklist.html Thursday 22 August 13
    • SIMPLE STORAGE SERVICE S3:Arguably AWS' best feature http://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/ http://aws.amazon.com/s3/ https://forums.aws.amazon.com/message.jspa?messageID=182919#182919 Thursday 22 August 13
    • CACHE WHATYOU CAN HTTP Responses, DB Queries, User content Browsers have caches too! http://cruncht.com/95/drupal-caching/ http://redis.io/ http://memcached.org/ http://varnish-cache.org/ Thursday 22 August 13
    • CLIENT SIDE STORAGE Keep a copy of your users data locally http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ Thursday 22 August 13
    • USE ELASTIC LOAD BALANCERS They will save you more than once http://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/ Thursday 22 August 13
    • USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure Thursday 22 August 13
    • SHOUT OUT: DYN DNS for Bit.ly, Quora,Twitter,Wikia, Fastly, etc http://dyn.com Thursday 22 August 13
    • USE IAM ROLES FOR ACCESS Humans make mistakes, including your humans Thursday 22 August 13
    • COST @ SCALE Scaling without breaking the bank http://mgx.com/blogs/wp-content/uploads/2013/07/piggybank.jpg Thursday 22 August 13
    • EMR + SPOT INSTANCES On demand rate: $0.165 / hour http://aws.amazon.com/ec2/spot-instances/ Thursday 22 August 13
    • AMAZON REDSHIFT Economical Business Intelligence Scales with data size http://www.flitemedia.com/music.php http://aws.amazon.com/redshift http://www.tableausoftware.com/ Thursday 22 August 13
    • AMAZON GLACIER "Tapes for the Cloud Era" Writes vastly cheaper than reads http://aws.amazon.com/glacier/http://www.gorp.com/parks-guide/glacier-national-park-outdoor-pp2-guide-cid350021.html Thursday 22 August 13
    • AWS SIMPLE EMAIL SERVICE Dealing with email is boring and time consuming http://aws.amazon.com/ses/http://bfsdaniels.copycop.com/blog/all-about-printing/hypertargeting-with-direct-mail/ Thursday 22 August 13
    • AWS SIMPLE QUEUE SERVICE Excellent for latency insensitive, small volume queues http://www.toledoblade.com/Retail/2013/01/13/Disney-s-magic-bracelet-new-key-to-its-kingdom.html http://aws.amazon.com/sqs/ http://colby.id.au/benchmarking-sqs Thursday 22 August 13
    • INSTANCE MARKETPLACE Buy & sell reserved instances http://commons.wikimedia.org/wiki/File:Javanese_market_place.jpg http://aws.amazon.com/ec2/reserved-instances/marketplace/ Thursday 22 August 13
    • AWS DYNAMO DB Excellent for small keys & high read rates at known & consistent IOPS http://hlbike.en.ecplaza.net/2.jpg http://aws.amazon.com/dynamodb/ Thursday 22 August 13
    • MAXIMIZE IOPS RAID 0 Ephemeral drives use m1.xlarge or c1.xlarge, or use ssds if you need >20k IOPS http://calculator.s3.amazonaws.com/calc5.html http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html#disk-performance Thursday 22 August 13
    • RED FLAGS Anti-patterns to watch out for http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/ Thursday 22 August 13
    • PROVISIONED IOPS EBS Ephemeral storage on c1/m1.xlarge or SSD is better If you must: m*large or c1.xlarge for dedicated NIC http://www.slideshare.net/AmazonWebServices/ebs-mongo-dbwebinarfinal-nn http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.htmlhttp://navidoo.ru/interest/Nasha_jizn/17676.html Thursday 22 August 13
    • AWS DYNAMO DB For high write rates or large/variable keys http://aws.amazon.com/dynamodb/http://www.walltowall.co.uk/program/standing-tall-worlds-tallest-people_93.aspx Thursday 22 August 13
    • HIGH IO/DISK/RAM NODES Use them deliberately http://elledecoration.co.za/2010/07/gigantic/ Thursday 22 August 13
    • AWS CLOUDWATCH Metric collection,Amazon style Cost prohibitive & resolution too low http://www.flickr.com/photos/65683080@N08/6893582132/ http://aws.amazon.com/cloudwatch/ Thursday 22 August 13
    • LOWER COST PER METRIC Use graphite & statsd http://graphite.wikidot.com/ https://github.com/etsy/statsd Thursday 22 August 13
    • HOSTED ALTERNATIVES Circonus:All the insights you ever wanted StackDriver: Optimized for AWS http://circonus.com http://stackdriver.com Thursday 22 August 13
    • AWS CLOUDFORMATION Templatize your entire stack Harder to use as complexity increases http://aws.amazon.com/cloudwatch/http://fullnfenil7.blogspot.com/2012/05/amazing-cloud-shapes-photos.html#.UhKrZmRgZHg Thursday 22 August 13
    • RDS FOR ANALYTICS/REPORTS Paying OLTP prices for BI usage Sharding will be a matter of time http://nerds.airbnb.com/redshift-performance-costhttp://business901.com/blog1/understanding-your-customer-problem/ Thursday 22 August 13
    • Q & A http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html @jiboumans http://slideshare.net/jiboumans Thursday 22 August 13