• Save
Reliability & Scale in AWS while letting you sleep through the night
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Reliability & Scale in AWS while letting you sleep through the night



More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around ...

More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.

This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.

It may make you coffee in the morning too.



Total Views
Views on SlideShare
Embed Views



11 Embeds 214

https://twitter.com 140
http://lanyrd.com 29
http://eventifier.co 13
https://si0.twimg.com 12
https://www.linkedin.com 8
http://www.linkedin.com 5
http://localhost 2
http://us-w1.rockmelt.com 2
http://twitter.com 1
http://www.Slideshare.net 1
https://twimg0-a.akamaihd.net 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Reliability & Scale in AWS while letting you sleep through the night Presentation Transcript

  • 1. ONE MAN OPS Reliability & Scale in AWS while letting you sleep through the night Jos Boumans - @jiboumanshttp://www.fwallpaper.net/picture_pics-Sleepy-cat.html
  • 2. ONE OF A KIND My own category
  • 3. RIPE NCCEngineering manager for RIPE Database http://www.ripe.net/db
  • 4. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  • 5. KRUXVP of Operations & Infrastructure http://www.krux.com/
  • 7. LOTS OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  • 8. 0 2,500 5,000 7,500 10,000 AVERAGE REQUESTS* / SEC *Twitter: New tweets Wikipedia: Articles readhttps://twitter.com/tps_watcher Krux: New data pointshttp://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  • 9. 0 125,000,000 250,000,000 375,000,000 500,000,000 MONTHLY UNIQUE USERShttp://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/
  • 10. WE CHOSE THE CLOUDhttp://previewnetworks.com/blog/
  • 11. THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
  • 12. FOCUS ON AWS http://aws.amazon.com/
  • 13. APRIL 21, 2011 http://aws.amazon.com/message/65648/http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
  • 15. JUNE 14, 2012http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/
  • 16. JUNE 29, 2012http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/
  • 17. AWS OUTAGE = YOUR OUTAGEhttp://it.mario.wikia.com/wiki/Lakitu
  • 18. THE RULES HAVE CHANGED Youre not in Kansas anymorehttp://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
  • 19. NETWORK WILL PARTITION And it will happen oftenhttp://thevinylvillain.blogspot.com/2010_04_01_archive.html
  • 20. DISK IO WILL FLUCTUATE On a good day, its mediocrehttp://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  • 21. IP ADDRESSES WILL CHANGE IP lease is 8 hours DNS TTL is 60 secondswww.fantom-xp.com
  • 22. INSTANCES WILL DIE And it will always be your Database Masterhttp://room57.deviantart.com/art/Hangman-188353196
  • 23. HUMANS MAKE MISTAKES Including your humans
  • 24. EMBRACE FAILURE Hardware will fail. Humans will make errors. Nature will produce thunderstorms.http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  • 25. ADJUST YOUR STRATEGY Dont bring a knife to a gun fighthttp://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
  • 26. DATA STORES Some work better than othershttp://gustavhoiland.com/2010/03/10/stacked-boxes/
  • 27. RDBMS CouchDB BigTable BasedDynamo Based Master / Slave based CAP THEOREM Your choice: sacrifice availability or consistency. Orange is a lie.
  • 28. MYSQL / ORACLE VS RDS See: Network partitioning & instances dying
  • 29. BIGTABLE BASED STORES HBase, Accumulo, Hypertable Still suffer when network partitioning happens http://www.cloudera.com/cdh4/
  • 30. DYNAMO BASED STORES Cassandra, Riak, DynamoDBhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
  • 31. GO HOSTED? CouchDB, MongoDB, Riak, Cassandra, HBase Your Latency May Varyhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
  • 32. CLIENT SIDE STORAGE Keep a copy of your users data locallyhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
  • 33. FILE STORES EBS vs Instance Storehttp://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
  • 34. SIMPLE STORAGE SERVICE S3: Arguably AWS best featurehttp://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
  • 35. TRAFFIC SHAPING Control every part of the requesthttp://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
  • 36. STAY LOCAL IF YOU CAN Going off box exposes you to risks you need to mitigatehttp://southshorewoman.com/issue/june-2010/article/local-character
  • 37. CACHE WHAT YOU CAN HTTP Responses, DB Queries, User content Browsers have caches too!http://theoatmeal.com/blog/charity_money
  • 38. USE ELASTIC LOAD BALANCERS They will save you more than oncehttp://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
  • 39. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure
  • 40. SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, etc
  • 41. USE A CDN Critical items should always be availablehttp://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
  • 42. MEASURE EVERYTHING Find outliers, deviants & trends before they cause troublehttp://www.themoviedb.org/movie/629-the-usual-suspects
  • 43. GRAPHITE, STATSD & COLLECTD Use Statsd & Collectd for application/system metrics Use graphite to store, aggregate & visualize http://hostedgraphite.com/http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  • 44. GRAPH EVENTS Deployments, outages, CDN reconfigurations, failed builds, etc Anything thats important to the health of your eco systemhttp://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
  • 45. COMPARE WEEK TO WEEK Overlay week to week graphs using timeShift() Quickly identifies trends and deviations from trendshttp://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
  • 46. FORECASTING Use Holt-Winters confidence bands Verify that your metrics are within normal tolerancehttps://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
  • 47. FIND INDIVIDUAL OUTLIERS Absolute numbers mean very little Use mean & standard deviationhttp://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
  • 48. ALERT ON TRENDS Once you go over a threshold, its too late Alert on unwanted trends and preemptively fixhttp://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
  • 49. MEASURE WITHOUT RETROFIT LogFormat "http.beacon:%D|ms" stats CustomLog "|nc -u localhost 8125" statshttp://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  • 50. SHOUT OUT: NEW RELIC Python, Ruby, .NET, Java, PHP supportIn depth profiling of your app for performance & errors.
  • 51. CONFIGURATION MANAGEMENT Unique snowflakes are badhttp://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
  • 52. PUPPET VS CHEF Yes. http://puppetlabs.com/ http://www.opscode.com/chef
  • 53. INFRASTRUCTURE AS CODE Use different environments Measure and report on ithttp://americansingercanary.com/green.htm
  • 54. SHOUT OUT: UBUNTU Ubuntu + cloud-init + boto = awesome* *I am biasedhttp://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
  • 55. DEV = PRODUCTION "I dunno, it worked on my laptop" Instead, use vagranthttp://vagrantup.com/ http://vagrantup.com/
  • 56. ROLL YOUR OWN AMIS Instantly boot up new deployments Reduce Time to Respondhttp://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
  • 57. CONFIDENT DEPLOYS That human error could be yourshttp://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
  • 58. CONTINUOUS INTEGRATION Ours: Github + Jenkins + FPM + apt::s3 From commit to deployable in one command http://github.com/ http://jenkins-ci.org/ https://github.com/thekad/apt-s3 https://github.com/jordansissel/fpm/wiki/
  • 59. ONE CLICK DEPLOYMENTS Deployments should not be exciting. Dont create a checklist; automate & trackhttp://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/
  • 60. DARK LAUNCHES Exercise the code without impacting the user experience http://www.kissmetrics.com/http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
  • 61. SHADOW TRAFFIC Test new code against live traffichttp://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323
  • 62. SLEEP TIGHT Slides at: www.Slideshare.net/jiboumans Were hiring: www.krux.comhttp://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html