ONE MAN OPS      Reliability & Scale in AWS while letting you sleep through the night                                     ...
ONE OF A KIND   My own category
RIPE NCCEngineering manager for RIPE Database                                        http://www.ripe.net/db
CANONICAL                    Engineering manager for Ubuntu Server 10.04 & 10.10http://lukeroberts.deviantart.com/art/Dest...
KRUXVP of Operations & Infrastructure                                    http://www.krux.com/
GOOD GUYS OF DATA PRIVACY
LOTS OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
0                              2,500                 5,000        7,500   10,000               AVERAGE REQUESTS* / SEC    ...
0                            125,000,000                            250,000,000   375,000,000   500,000,000               ...
WE CHOSE THE CLOUDhttp://previewnetworks.com/blog/
THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
FOCUS ON AWS               http://aws.amazon.com/
APRIL 21, 2011                                                                                                            ...
... SOME OUTAGES ...... SKIPPED FOR BREVITY ...
JUNE 14, 2012http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html   http://blog.pagerduty.com/2012/...
JUNE 29, 2012http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper   http://aws.amazon.co...
AWS OUTAGE = YOUR OUTAGEhttp://it.mario.wikia.com/wiki/Lakitu
THE RULES HAVE CHANGED                                                        Youre not in Kansas anymorehttp://entreatmen...
NETWORK WILL PARTITION                                                              And it will happen oftenhttp://theviny...
DISK IO WILL FLUCTUATE                                                     On a good day, its mediocrehttp://www.freeguide...
IP ADDRESSES WILL CHANGE                     IP lease is 8 hours                    DNS TTL is 60 secondswww.fantom-xp.com
INSTANCES WILL DIE                                  And it will always be your Database Masterhttp://room57.deviantart.com...
HUMANS MAKE MISTAKES     Including your humans
EMBRACE FAILURE                                Hardware will fail. Humans will make errors.                               ...
ADJUST YOUR STRATEGY                                                      Dont bring a knife to a gun fighthttp://www.flick...
DATA STORES                                                     Some work better than othershttp://gustavhoiland.com/2010/...
RDBMS  CouchDB                                                   BigTable BasedDynamo Based                               ...
MYSQL / ORACLE VS RDS  See: Network partitioning & instances dying
BIGTABLE BASED STORES            HBase, Accumulo, Hypertable Still suffer when network partitioning happens               ...
DYNAMO BASED STORES                                                         Cassandra, Riak, DynamoDBhttp://www.fromoldboo...
GO HOSTED?                                 CouchDB, MongoDB, Riak, Cassandra, HBase                                       ...
CLIENT SIDE STORAGE                                          Keep a copy of your users data locallyhttp://www.wired.com/ga...
FILE STORES                                                                   EBS vs Instance Storehttp://homedezine.blogs...
SIMPLE STORAGE SERVICE                                                        S3: Arguably AWS best featurehttp://www.iwal...
TRAFFIC SHAPING                                                Control every part of the requesthttp://www.visualphotos.co...
STAY LOCAL IF YOU CAN                 Going off box exposes you to risks you need to mitigatehttp://southshorewoman.com/is...
CACHE WHAT YOU CAN                                  HTTP Responses, DB Queries, User content                              ...
USE ELASTIC LOAD BALANCERS                                                They will save you more than oncehttp://wallpape...
USE GLOBAL LOAD BALANCING  Fail over to the closest data center on region failure
SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, etc
USE A CDN                                        Critical items should always be availablehttp://kadanthuponanimidangal.bl...
MEASURE EVERYTHING                Find outliers, deviants & trends before they cause troublehttp://www.themoviedb.org/movi...
GRAPHITE, STATSD & COLLECTD                       Use Statsd & Collectd for application/system metrics                    ...
GRAPH EVENTS         Deployments, outages, CDN reconfigurations, failed builds, etc          Anything thats important to th...
COMPARE WEEK TO WEEK                          Overlay week to week graphs using timeShift()                         Quickl...
FORECASTING                                 Use Holt-Winters confidence bands                        Verify that your metri...
FIND INDIVIDUAL OUTLIERS                                                      Absolute numbers mean very little           ...
ALERT ON TRENDS                                Once you go over a threshold, its too late                              Ale...
MEASURE WITHOUT RETROFIT                                          LogFormat "http.beacon:%D|ms" stats                     ...
SHOUT OUT: NEW RELIC         Python, Ruby, .NET, Java, PHP supportIn depth profiling of your app for performance & errors.
CONFIGURATION MANAGEMENT                                                             Unique snowflakes are badhttp://www.to...
PUPPET VS CHEF      Yes.                         http://puppetlabs.com/                 http://www.opscode.com/chef
INFRASTRUCTURE AS CODE                                            Use different environments                              ...
SHOUT OUT: UBUNTU                                      Ubuntu + cloud-init + boto = awesome*                              ...
DEV = PRODUCTION                          "I dunno, it worked on my laptop"                                 Instead, use v...
ROLL YOUR OWN AMIS                                                Instantly boot up new deployments                       ...
CONFIDENT DEPLOYS                                                   That human error could be yourshttp://www.etsy.com/lis...
CONTINUOUS INTEGRATION      Ours: Github + Jenkins + FPM + apt::s3   From commit to deployable in one command             ...
ONE CLICK DEPLOYMENTS                                        Deployments should not be exciting.                          ...
DARK LAUNCHES               Exercise the code without impacting the user experience                                       ...
SHADOW TRAFFIC                                                    Test new code against live traffichttp://doppelthingers.t...
SLEEP TIGHT                                           Slides at: www.Slideshare.net/jiboumans                             ...
Upcoming SlideShare
Loading in...5
×

Reliability & Scale in AWS while letting you sleep through the night

5,824

Published on

More and more startups/companies are deploying their infrastructure directly and exclusively in EC2 or similar cloud provider. With that comes a whole new set of challenges and paradigms around scalability, reliability and availability.

This talk will focus on how to leverage all the infrastructure parts of AWS, augment them with great (affordable) third party services and solid Open Source Software to create an operations environment that will scale with you, be as reliable as it can be, providing you and your peers with all the data you need to make good decisions to support (rapid) changes while letting you sleep through the night. And all that using a tiny operations team.

It may make you coffee in the morning too.

Published in: Technology, Business
0 Comments
22 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,824
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
22
Embeds 0
No embeds

No notes for slide

Transcript of "Reliability & Scale in AWS while letting you sleep through the night "

  1. 1. ONE MAN OPS Reliability & Scale in AWS while letting you sleep through the night Jos Boumans - @jiboumanshttp://www.fwallpaper.net/picture_pics-Sleepy-cat.html
  2. 2. ONE OF A KIND My own category
  3. 3. RIPE NCCEngineering manager for RIPE Database http://www.ripe.net/db
  4. 4. CANONICAL Engineering manager for Ubuntu Server 10.04 & 10.10http://lukeroberts.deviantart.com/art/Destroy-Ubuntu-93235775 http://www.ubuntu.com/business/server/overview
  5. 5. KRUXVP of Operations & Infrastructure http://www.krux.com/
  6. 6. GOOD GUYS OF DATA PRIVACY
  7. 7. LOTS OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html
  8. 8. 0 2,500 5,000 7,500 10,000 AVERAGE REQUESTS* / SEC *Twitter: New tweets Wikipedia: Articles readhttps://twitter.com/tps_watcher Krux: New data pointshttp://stats.wikimedia.org/EN/TablesPageViewsMonthlyCombined.htm
  9. 9. 0 125,000,000 250,000,000 375,000,000 500,000,000 MONTHLY UNIQUE USERShttp://www.mediabistro.com/alltwitter/twitter-active-total-users_b17655http://technorati.com/technology/article/wikipedias-nonprofit-parent-raises-20-million/
  10. 10. WE CHOSE THE CLOUDhttp://previewnetworks.com/blog/
  11. 11. THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines
  12. 12. FOCUS ON AWS http://aws.amazon.com/
  13. 13. APRIL 21, 2011 http://aws.amazon.com/message/65648/http://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/ http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
  14. 14. ... SOME OUTAGES ...... SKIPPED FOR BREVITY ...
  15. 15. JUNE 14, 2012http://www.laczik.org/BMW/repair/E38_wiring_harness/E38_wiring_harness.html http://blog.pagerduty.com/2012/06/outage-post-mortem-june-14/
  16. 16. JUNE 29, 2012http://www.fanpop.com/spots/thunderstorm/images/25416163/title/thunderstorms-wallpaper http://aws.amazon.com/message/67457/
  17. 17. AWS OUTAGE = YOUR OUTAGEhttp://it.mario.wikia.com/wiki/Lakitu
  18. 18. THE RULES HAVE CHANGED Youre not in Kansas anymorehttp://entreatmenot.blogspot.com/2011/04/shattered-dreams.html
  19. 19. NETWORK WILL PARTITION And it will happen oftenhttp://thevinylvillain.blogspot.com/2010_04_01_archive.html
  20. 20. DISK IO WILL FLUCTUATE On a good day, its mediocrehttp://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  21. 21. IP ADDRESSES WILL CHANGE IP lease is 8 hours DNS TTL is 60 secondswww.fantom-xp.com
  22. 22. INSTANCES WILL DIE And it will always be your Database Masterhttp://room57.deviantart.com/art/Hangman-188353196
  23. 23. HUMANS MAKE MISTAKES Including your humans
  24. 24. EMBRACE FAILURE Hardware will fail. Humans will make errors. Nature will produce thunderstorms.http://www.freeguidetonwcamping.com/oregon_washington_main/washington/southwest_wa/cape_disappointment_sp.htm
  25. 25. ADJUST YOUR STRATEGY Dont bring a knife to a gun fighthttp://www.flickr.com/photos/statlerhotel/6628770499/sizes/l/in/photostream/
  26. 26. DATA STORES Some work better than othershttp://gustavhoiland.com/2010/03/10/stacked-boxes/
  27. 27. RDBMS CouchDB BigTable BasedDynamo Based Master / Slave based CAP THEOREM Your choice: sacrifice availability or consistency. Orange is a lie.
  28. 28. MYSQL / ORACLE VS RDS See: Network partitioning & instances dying
  29. 29. BIGTABLE BASED STORES HBase, Accumulo, Hypertable Still suffer when network partitioning happens http://www.cloudera.com/cdh4/
  30. 30. DYNAMO BASED STORES Cassandra, Riak, DynamoDBhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html http://aws.amazon.com/dynamodb/faqs/
  31. 31. GO HOSTED? CouchDB, MongoDB, Riak, Cassandra, HBase Your Latency May Varyhttp://www.fromoldbooks.org/Walker-ElectricLightingForShips/pages/015-Siemens-Alternate-Current-Dynamo//1552x1175-q75.html
  32. 32. CLIENT SIDE STORAGE Keep a copy of your users data locallyhttp://www.wired.com/gadgetlab/2012/03/badass-gadget-ammo-lunch-box/ http://www.w3.org/2001/tag/2010/09/ClientSideStorage.html
  33. 33. FILE STORES EBS vs Instance Storehttp://homedezine.blogspot.com/2011/04/day-my-cat-removed-carpet-photo-studio.html
  34. 34. SIMPLE STORAGE SERVICE S3: Arguably AWS best featurehttp://www.iwallpaper.us/gold-star-fo-christmas-wallpaper-140/
  35. 35. TRAFFIC SHAPING Control every part of the requesthttp://www.visualphotos.com/image/2x4154765/man_standing_with_traffic_cones_in_shape_of_u-turn
  36. 36. STAY LOCAL IF YOU CAN Going off box exposes you to risks you need to mitigatehttp://southshorewoman.com/issue/june-2010/article/local-character
  37. 37. CACHE WHAT YOU CAN HTTP Responses, DB Queries, User content Browsers have caches too!http://theoatmeal.com/blog/charity_money
  38. 38. USE ELASTIC LOAD BALANCERS They will save you more than oncehttp://wallpapers5.com/wallpaper/Balance-Green-Tree-Frog/
  39. 39. USE GLOBAL LOAD BALANCING Fail over to the closest data center on region failure
  40. 40. SHOUT OUT: DYNDNS for Bit.ly, Quora, Twitter, Wikia, etc
  41. 41. USE A CDN Critical items should always be availablehttp://kadanthuponanimidangal.blogspot.com/2010/12/blog-post_6992.html
  42. 42. MEASURE EVERYTHING Find outliers, deviants & trends before they cause troublehttp://www.themoviedb.org/movie/629-the-usual-suspects
  43. 43. GRAPHITE, STATSD & COLLECTD Use Statsd & Collectd for application/system metrics Use graphite to store, aggregate & visualize http://hostedgraphite.com/http://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  44. 44. GRAPH EVENTS Deployments, outages, CDN reconfigurations, failed builds, etc Anything thats important to the health of your eco systemhttp://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
  45. 45. COMPARE WEEK TO WEEK Overlay week to week graphs using timeShift() Quickly identifies trends and deviations from trendshttp://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-10
  46. 46. FORECASTING Use Holt-Winters confidence bands Verify that your metrics are within normal tolerancehttps://github.com/ripienaar/graphite-graph-dsl/wiki/Creating-Holt-Winters-Forecasts
  47. 47. FIND INDIVIDUAL OUTLIERS Absolute numbers mean very little Use mean & standard deviationhttp://en.wikipedia.org/wiki/File:Black_sheep-1.jpg
  48. 48. ALERT ON TRENDS Once you go over a threshold, its too late Alert on unwanted trends and preemptively fixhttp://sub-second.blogspot.com/2012/06/reporting-response-times-percentile.html http://aphyr.github.com/riemann/
  49. 49. MEASURE WITHOUT RETROFIT LogFormat "http.beacon:%D|ms" stats CustomLog "|nc -u localhost 8125" statshttp://absinthemindedhero.blogspot.com/2012/03/victory-nonetheless.html http://jiboumans.wordpress.com/2012/07/02/measure-all-the-things/
  50. 50. SHOUT OUT: NEW RELIC Python, Ruby, .NET, Java, PHP supportIn depth profiling of your app for performance & errors.
  51. 51. CONFIGURATION MANAGEMENT Unique snowflakes are badhttp://www.torange.us/Plants/Conifers/spruce-needles-in-hoarfrost-424.html
  52. 52. PUPPET VS CHEF Yes. http://puppetlabs.com/ http://www.opscode.com/chef
  53. 53. INFRASTRUCTURE AS CODE Use different environments Measure and report on ithttp://americansingercanary.com/green.htm
  54. 54. SHOUT OUT: UBUNTU Ubuntu + cloud-init + boto = awesome* *I am biasedhttp://www.123rf.com/photo_4871141_food-pyramid-isolated-on-white.html https://github.com/krux/ops-tools
  55. 55. DEV = PRODUCTION "I dunno, it worked on my laptop" Instead, use vagranthttp://vagrantup.com/ http://vagrantup.com/
  56. 56. ROLL YOUR OWN AMIS Instantly boot up new deployments Reduce Time to Respondhttp://bakingismyzen.blogspot.com/2011/07/beignets-cant-have-just-one.html http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/
  57. 57. CONFIDENT DEPLOYS That human error could be yourshttp://www.etsy.com/listing/37178125/stormtrooper-regrets-those-were-the
  58. 58. CONTINUOUS INTEGRATION Ours: Github + Jenkins + FPM + apt::s3 From commit to deployable in one command http://github.com/ http://jenkins-ci.org/ https://github.com/thekad/apt-s3 https://github.com/jordansissel/fpm/wiki/
  59. 59. ONE CLICK DEPLOYMENTS Deployments should not be exciting. Dont create a checklist; automate & trackhttp://www.thegreenhead.com/2012/07/one-click-butter-cutter.php https://checkmarkable.com/
  60. 60. DARK LAUNCHES Exercise the code without impacting the user experience http://www.kissmetrics.com/http://www.layoutsparks.com/pictures/moon-23 https://github.com/yahoo/boomerang/
  61. 61. SHADOW TRAFFIC Test new code against live traffichttp://doppelthingers.tumblr.com/post/12839979386/traffic-light-shadow-hangman-and-possibly-his https://gist.github.com/3125323
  62. 62. SLEEP TIGHT Slides at: www.Slideshare.net/jiboumans Were hiring: www.krux.comhttp://raafay-awan.blogspot.com/2011/08/cats-cutest-of-creatures.html

×