Your SlideShare is downloading. ×
ONE MAN OPS      Reliability & Scale in AWS while letting you sleep through the night                                     ...
RIPE NCC                      Engineering manager for RIPE Database                                                       ...
CANONICAL                    Engineering manager for Ubuntu Server 10.04 & 10.10http://lukeroberts.deviantart.com/art/Dest...
KRUX                      VP of Operations & Infrastructure                                                          http:...
GOOD GUYS OF DATA PRIVACYTuesday 26 March 13
SOME OF OUR CUSTOMERSTuesday 26 March 13
LOTS OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.htmlTuesday 26 March 13
0                              2,500                 5,000        7,500   10,000               AVERAGE REQUESTS* / SEC    ...
0                          150,000,000                          300,000,000              450,000,000   600,000,000        ...
WE CHOSE THE CLOUDhttp://previewnetworks.com/blog/Tuesday 26 March 13
THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlinesTuesday 26 March 13
FOCUS ON AWS                                     http://aws.amazon.com/Tuesday 26 March 13
APRIL 21, 2011                                                                                                            ...
... SOME OUTAGES ...                 ... SKIPPED FOR BREVITY ...Tuesday 26 March 13
JUNE 14, 2012http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html   http://blog.pagerduty.com/2012/...
JUNE 29, 2012http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper   http://aws.amazon.co...
AWS OUTAGE = YOUR OUTAGEhttp://it.mario.wikia.com/wiki/LakituTuesday 26 March 13
THE RULES HAVE CHANGED                                                        Youre not in Kansas anymorehttp://entreatmen...
NETWORK WILL PARTITION                                                              And it will happen oftenhttp://theviny...
DISK IO WILL FLUCTUATE                                                     On a good day, its mediocrehttp://www.freeguide...
IP ADDRESSES WILL CHANGE                       IP lease is 8 hours                      DNS TTL is 60 secondswww.fantom-xp...
INSTANCES WILL DIE                                  And it will always be your Database Masterhttp://room57.deviantart.com...
HUMANS MAKE MISTAKES                      Including your humansTuesday 26 March 13
EMBRACE FAILURE                                Hardware will fail. Humans will make errors.                               ...
OR, COLLOQUIALLYTuesday 26 March 13
ADJUST YOUR STRATEGY                                                      Dont bring a knife to a gun fighthttp://www.flick...
DATA STORES                                                     Some work better than othershttp://gustavhoiland.com/2010/...
RDBMS         CouchDB                                                                  BigTable Based       Dynamo Based  ...
MYSQL / ORACLE VS RDS                      See: Network partitioning & instances dyingTuesday 26 March 13
AMAZON REDSHIFT                                      Great for analytics/reports, bad for OLTP                            ...
BIGTABLE BASED STORES                                 HBase, Accumulo, Hypertable                      Still suffer when n...
DYNAMO BASED STORES                                                         Cassandra, Riak, DynamoDBhttp://www.fromoldboo...
GO HOSTED?                                 CouchDB, MongoDB, Riak, Cassandra, HBase                                       ...
CLIENT SIDE STORAGE                                          Keep a copy of your users data locallyhttp://www.wired.com/ga...
FILE STORES                                                                EBS vs Instance Store ...                      ...
SIMPLE STORAGE SERVICE                                                        S3: Arguably AWS best featurehttp://www.iwal...
TRAFFIC SHAPING                                                Control every part of the requesthttp://www.visualphotos.co...
STAY LOCAL IF YOU CAN                 Going off box exposes you to risks you need to mitigatehttp://southshorewoman.com/is...
CACHE WHAT YOU CAN                                  HTTP Responses, DB Queries, User content                              ...
USE ELASTIC LOAD BALANCERS                                                They will save you more than oncehttp://wallpape...
USE GLOBAL LOAD BALANCING                      Fail over to the closest data center on region failureTuesday 26 March 13
SHOUT OUT: DYN                      DNS for Bit.ly, Quora, Twitter, Wikia, etcTuesday 26 March 13
USE A CDN                                        Critical items should always be availablehttp://kadanthuponanimidangal.bl...
MEASURE EVERYTHING                Find outliers, deviants & trends before they cause troublehttp://www.themoviedb.org/movi...
GRAPHITE, STATSD & COLLECTD                       Use Statsd & Collectd for application/system metrics                    ...
GRAPH EVENTS         Deployments, outages, CDN reconfigurations, failed builds, etc          Anything thats important to th...
COMPARE WEEK TO WEEK                          Overlay week to week graphs using timeShift()                         Quickl...
FORECASTING                                 Use Holt-Winters confidence bands                        Verify that your metri...
FIND INDIVIDUAL OUTLIERS                                                      Absolute numbers mean very little           ...
ALERT ON TRENDS                                Once you go over a threshold, its too late                              Ale...
MEASURE WITHOUT RETROFIT                                          LogFormat "http.beacon:%D|ms" stats                     ...
SHOUT OUT: NEW RELIC             Java, but also Python, Ruby, .NET, PHP & NodeJS support             In depth profiling of ...
CONFIGURATION MANAGEMENT                                                             Unique snowflakes are badhttp://www.to...
PUPPET VS CHEF                            Yes.                                               http://puppetlabs.com/       ...
INFRASTRUCTURE AS CODE                                            Use different environments                              ...
SHOUT OUT: UBUNTU                                      Ubuntu + cloud-init + boto = awesome*                              ...
AWS OPSWORKS                                  Hosted Chef, No extra charge, Ubuntu 12.04 or Amazon Linux                  ...
DEV = PRODUCTION                          "I dunno, it worked on my laptop"                                 Instead, use v...
ROLL YOUR OWN AMIS                                                Instantly boot up new deployments                       ...
CONFIDENT DEPLOYS                                                   That human error could be yourshttp://www.etsy.com/lis...
CONTINUOUS INTEGRATION                         Ours: Github + Jenkins + FPM + apt::s3                      From commit to ...
ONE CLICK DEPLOYMENTS                                        Deployments should not be exciting.                          ...
DARK LAUNCHES               Exercise the code without impacting the user experience                                       ...
SHADOW TRAFFIC                                                    Test new code against live traffichttp://doppelthingers.t...
SLEEP TIGHT                                           Slides at: www.Slideshare.net/jiboumans                             ...
Upcoming SlideShare
Loading in...5
×

Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night

2,404

Published on

Updated version of Reliability & Scale in AWS while letting you sleep through the night
===============================================================

More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.

This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.

It may make you coffee in the morning too.

Published in: Technology

Transcript of "Devoxx UK: Reliability & Scale in AWS while letting you sleep through the night "

  1. 1. ONE MAN OPS Reliability & Scale in AWS while letting you sleep through the night Jos Boumans - @jiboumanshttp://www.fwallpaper.net/picture_pics-Sleepy-cat.htmlTuesday 26 March 13
  2. 2. RIPE NCC Engineering manager for RIPE Database http://www.ripe.net/dbTuesday 26 March 13
  3. 3. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overviewTuesday 26 March 13
  4. 4. KRUX VP of Operations & Infrastructure http://www.krux.com/Tuesday 26 March 13
  5. 5. GOOD GUYS OF DATA PRIVACYTuesday 26 March 13
  6. 6. SOME OF OUR CUSTOMERSTuesday 26 March 13
  7. 7. LOTS OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.htmlTuesday 26 March 13
  8. 8. 0 2,500 5,000 7,500 10,000 AVERAGE REQUESTS* / SEC *Twitter: New tweets Wikipedia: Articles readhttps://twitter.com/tps_watcher Krux: New data pointshttp://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htmTuesday 26 March 13
  9. 9. 0 150,000,000 300,000,000 450,000,000 600,000,000 MONTHLY UNIQUE USERShttp://techcrunch.com/2012/12/18/twitter-passes-200m-monthly-active-users-a-42-increase-over-9-months/http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/Tuesday 26 March 13
  10. 10. WE CHOSE THE CLOUDhttp://previewnetworks.com/blog/Tuesday 26 March 13
  11. 11. THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlinesTuesday 26 March 13
  12. 12. FOCUS ON AWS http://aws.amazon.com/Tuesday 26 March 13
  13. 13. APRIL 21, 2011 http://aws.amazon.com/message/65648/http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.htmlTuesday 26 March 13
  14. 14. ... SOME OUTAGES ... ... SKIPPED FOR BREVITY ...Tuesday 26 March 13
  15. 15. JUNE 14, 2012http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/Tuesday 26 March 13
  16. 16. JUNE 29, 2012http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/Tuesday 26 March 13
  17. 17. AWS OUTAGE = YOUR OUTAGEhttp://it.mario.wikia.com/wiki/LakituTuesday 26 March 13
  18. 18. THE RULES HAVE CHANGED Youre not in Kansas anymorehttp://entreatmenot.blogspot.com/2011/04/shattered-dreams.htmlTuesday 26 March 13
  19. 19. NETWORK WILL PARTITION And it will happen oftenhttp://thevinylvillain.blogspot.com/2010_04_01_archive.htmlTuesday 26 March 13
  20. 20. DISK IO WILL FLUCTUATE On a good day, its mediocrehttp://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htmTuesday 26 March 13
  21. 21. IP ADDRESSES WILL CHANGE IP lease is 8 hours DNS TTL is 60 secondswww.fantom-xp.comTuesday 26 March 13
  22. 22. INSTANCES WILL DIE And it will always be your Database Masterhttp://room57.deviantart.com/art/Hangman-188353196Tuesday 26 March 13
  23. 23. HUMANS MAKE MISTAKES Including your humansTuesday 26 March 13
  24. 24. EMBRACE FAILURE Hardware will fail. Humans will make errors. Nature will produce thunderstorms.http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htmTuesday 26 March 13
  25. 25. OR, COLLOQUIALLYTuesday 26 March 13
  26. 26. ADJUST YOUR STRATEGY Dont bring a knife to a gun fighthttp://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/Tuesday 26 March 13
  27. 27. DATA STORES Some work better than othershttp://gustavhoiland.com/2010/03/10/stacked-boxes/Tuesday 26 March 13
  28. 28. RDBMS CouchDB BigTable Based Dynamo Based Master / Slave based CAP THEOREM Your choice: sacrifice availability or consistency. Orange is a lie.Tuesday 26 March 13
  29. 29. MYSQL / ORACLE VS RDS See: Network partitioning & instances dyingTuesday 26 March 13
  30. 30. AMAZON REDSHIFT Great for analytics/reports, bad for OLTP Unburden your RDS instanceshttp://www.flitemedia.com/music.php http://aws.amazon.com/redshiftTuesday 26 March 13
  31. 31. BIGTABLE BASED STORES HBase, Accumulo, Hypertable Still suffer when network partitioning happens http://www.cloudera.com/cdh4/Tuesday 26 March 13
  32. 32. DYNAMO BASED STORES Cassandra, Riak, DynamoDBhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/Tuesday 26 March 13
  33. 33. GO HOSTED? CouchDB, MongoDB, Riak, Cassandra, HBase Your Latency May Varyhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.htmlTuesday 26 March 13
  34. 34. CLIENT SIDE STORAGE Keep a copy of your users data locallyhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.htmlTuesday 26 March 13
  35. 35. FILE STORES EBS vs Instance Store ... ... vs RamFShttp://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.htmlTuesday 26 March 13
  36. 36. SIMPLE STORAGE SERVICE S3: Arguably AWS best featurehttp://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/Tuesday 26 March 13
  37. 37. TRAFFIC SHAPING Control every part of the requesthttp://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turnTuesday 26 March 13
  38. 38. STAY LOCAL IF YOU CAN Going off box exposes you to risks you need to mitigatehttp://southshorewoman.com/issue/june-2010/article/local-characterTuesday 26 March 13
  39. 39. CACHE WHAT YOU CAN HTTP Responses, DB Queries, User content Browsers have caches too!http://theoatmeal.com/blog/charity_moneyTuesday 26 March 13
  40. 40. USE ELASTIC LOAD BALANCERS They will save you more than oncehttp://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/Tuesday 26 March 13
  41. 41. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failureTuesday 26 March 13
  42. 42. SHOUT OUT: DYN DNS for Bit.ly, Quora, Twitter, Wikia, etcTuesday 26 March 13
  43. 43. USE A CDN Critical items should always be availablehttp://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.htmlTuesday 26 March 13
  44. 44. MEASURE EVERYTHING Find outliers, deviants & trends before they cause troublehttp://www.themoviedb.org/movie/629-the-usual-suspectsTuesday 26 March 13
  45. 45. GRAPHITE, STATSD & COLLECTD Use Statsd & Collectd for application/system metrics Use graphite to store, aggregate & visualize http://hostedgraphite.com/http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/Tuesday 26 March 13
  46. 46. GRAPH EVENTS Deployments, outages, CDN reconfigurations, failed builds, etc Anything thats important to the health of your eco systemhttp://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/Tuesday 26 March 13
  47. 47. COMPARE WEEK TO WEEK Overlay week to week graphs using timeShift() Quickly identifies trends and deviations from trendshttp://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10Tuesday 26 March 13
  48. 48. FORECASTING Use Holt-Winters confidence bands Verify that your metrics are within normal tolerancehttps://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-ForecastsTuesday 26 March 13
  49. 49. FIND INDIVIDUAL OUTLIERS Absolute numbers mean very little Use mean & standard deviationhttp://en.wikipedia.org/wiki/File:Black_sheep-1.jpgTuesday 26 March 13
  50. 50. ALERT ON TRENDS Once you go over a threshold, its too late Alert on unwanted trends and preemptively fixhttp://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/Tuesday 26 March 13
  51. 51. MEASURE WITHOUT RETROFIT LogFormat "http.beacon:%D|ms" stats CustomLog "|nc -u localhost 8125" stats http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/http://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2013/02/27/realtime-stats-from-varnish/Tuesday 26 March 13
  52. 52. SHOUT OUT: NEW RELIC Java, but also Python, Ruby, .NET, PHP & NodeJS support In depth profiling of your app for performance & errors.Tuesday 26 March 13
  53. 53. CONFIGURATION MANAGEMENT Unique snowflakes are badhttp://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.htmlTuesday 26 March 13
  54. 54. PUPPET VS CHEF Yes. http://puppetlabs.com/ http://www.opscode.com/chefTuesday 26 March 13
  55. 55. INFRASTRUCTURE AS CODE Use different environments Measure and report on ithttp://americansingercanary.com/green.htmTuesday 26 March 13
  56. 56. SHOUT OUT: UBUNTU Ubuntu + cloud-init + boto = awesome* *I am biasedhttp://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-toolsTuesday 26 March 13
  57. 57. AWS OPSWORKS Hosted Chef, No extra charge, Ubuntu 12.04 or Amazon Linux Still rough around the edges.http://thebrandbuilder.files.wordpress.com/2011/08/gordon-01.jpg http://aws.amazon.com/opsworks/Tuesday 26 March 13
  58. 58. DEV = PRODUCTION "I dunno, it worked on my laptop" Instead, use vagranthttp://vagrantup.com/ http://vagrantup.com/Tuesday 26 March 13
  59. 59. ROLL YOUR OWN AMIS Instantly boot up new deployments Reduce Time to Respondhttp://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/Tuesday 26 March 13
  60. 60. CONFIDENT DEPLOYS That human error could be yourshttp://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-theTuesday 26 March 13
  61. 61. CONTINUOUS INTEGRATION Ours: Github + Jenkins + FPM + apt::s3 From commit to deployable in one command http://github.com/ http://jenkins-ci.org/ https://github.com/thekad/apt-s3 https://github.com/jordansissel/fpm/wiki/Tuesday 26 March 13
  62. 62. ONE CLICK DEPLOYMENTS Deployments should not be exciting. Dont create a checklist; automate & track https://checkmarkable.comhttp://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://github.com/jib/aws-analysis-tools/Tuesday 26 March 13
  63. 63. DARK LAUNCHES Exercise the code without impacting the user experience http://www.kissmetrics.com/http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/Tuesday 26 March 13
  64. 64. SHADOW TRAFFIC Test new code against live traffichttp://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323Tuesday 26 March 13
  65. 65. SLEEP TIGHT Slides at: www.Slideshare.net/jiboumans Were hiring: www.krux.comhttp://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.htmlTuesday 26 March 13

×