• Save
Upcoming SlideShare
Loading in...5

Rebooting a Cloud



Video of this talk is here:

Video of this talk is here:



Total Views
Views on SlideShare
Embed Views



5 Embeds 475

http://www.opscode.com 417
http://blog.sahsu.mobi 55
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
https://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Rebooting a Cloud Rebooting a Cloud Presentation Transcript

  • “Rebooting a Cloud” Jesse Robbins Cofounder, Opscode @jesserobbins http://www.flickr.com/photos/mwichary/2355832413/
  • Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
  • “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  • Opscode is Hiring!
  • Press button,Receive candy!(who here wants candy?)(there is always someone.) http://www.flickr.com/photos/mwichary/2355832413/
  • “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  • Complexity Increases,Latent Defects Accumulate, & FAILURE HAPPENS!
  • define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  • 99.9% * 99.9% * 99.9% = 99.7%(oops!)
  • Everyone goes through the same thing...
  • DenialThis isn’t happening.
  • (more anger)“#@(* #$#$ %@%!!#$@ #*$#@$ @$*&#($*&@#(*$(”
  • BargainingIf we do this one thing, it will stop happening.
  • DepressionWhat’s the point if this is just going to happen?
  • Acceptance“Failure Happens. We’ll just have to design for it.”
  • Successful companies say:“Failure Happens”“Embrace Failure”“Design For Failure”“Healthy attitude about Failure”“Resilient (to Failure)”THE OUTAGES WILL CONTINUEUNTIL THE APPROACHIMPROVES ;-)
  • GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  • define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  • Resilience is a product of People & Technology
  • GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose when discover them, instead of letting that be determined by the next real disaster.
  • start small...http://www.flickr.com/photos/oakleyoriginals/5674150237
  • increase awareness http://www.flickr.com/photos/maunzy/5099921731
  • build confidencehttp://www.flickr.com/photos/skevbo/4864249944
  • full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
  • safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067
  • no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography
  • Lessons Learned:Every Post-mortem, Ever...
  • Root Cause:“a perfect storm ofimpossible events”
  • Lesson #1“we have a bunch of manual processes which we need to automate”
  • (thanks to Hyperbole & A Half))
  • Infrastructure as Code:Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare resources.
  • Lesson #1.5“Golden Images don’t work in dynamic (ie:Cloud) Environments”
  • Golden Images are not the answer• Gold is heavy & expensive• Hard to transport• Hard to mold• Easy to lose configuration detail http://www.flickr.com/photos/garysoup/2977173063/
  • When this Varnish Jboss App Memcache Postgres Slaves Postgres Master
  • Becomes this... Varnish Jboss App Memcache Postgres Slaves Postgres Master
  • • Load balancerChange is only Constant config • Nagios host ping Varnish • Nagios host ssh • Nagios host HTTP • Nagios host app Jboss App health • Graphite CPU • Graphite Memcache Memory • Graphite Disk Postgres Slaves • Graphite SNMP • Memcache• Postgres Master 12+ resource changes for 1 node addition firewall • Postgres firewall
  • Scale & Complexity increase! AZ2 DC1 AZ3
  • CLONING CANNOT COPE WITH THIS http://www.flickr.com/photos/evelynishere/2798236471/
  • (thanks to Hyperbole & A Half )
  • Lesson #2 “We need betterIncident Management”
  • http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
  • ♥ Dev+Ops CultureSlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
  • make it fun!
  • Lesson #3“(load balancing|website|DNS|database|deployment|storage| provisioning|cloud|etc) failover didn’t work. We need to test & maintain our emergency tools & processes”
  • use it or lose it.
  • Give yourself lots of knobs& levers & use them as part of regular process.
  • I ❤ MTTR
  • (thanks to Hyperbole & A Half )
  • GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  • Thank You!!!Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com