• Save
Upcoming SlideShare
Loading in...5

Rebooting a Cloud



Video of this talk is here:

Video of this talk is here:



Total Views
Views on SlideShare
Embed Views



4 Embeds 474

http://www.opscode.com 417
http://blog.sahsu.mobi 55
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


13 of 3 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Rebooting a Cloud Rebooting a Cloud Presentation Transcript

    • “Rebooting a Cloud” Jesse Robbins Cofounder, Opscode @jesserobbins http://www.flickr.com/photos/mwichary/2355832413/
    • Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
    • “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
    • Opscode is Hiring!
    • Press button,Receive candy!(who here wants candy?)(there is always someone.) http://www.flickr.com/photos/mwichary/2355832413/
    • “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
    • Complexity Increases,Latent Defects Accumulate, & FAILURE HAPPENS!
    • define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
    • 99.9% * 99.9% * 99.9% = 99.7%(oops!)
    • Everyone goes through the same thing...
    • DenialThis isn’t happening.
    • (more anger)“#@(* #$#$ %@%!!#$@ #*$#@$ @$*&#($*&@#(*$(”
    • BargainingIf we do this one thing, it will stop happening.
    • DepressionWhat’s the point if this is just going to happen?
    • Acceptance“Failure Happens. We’ll just have to design for it.”
    • Successful companies say:“Failure Happens”“Embrace Failure”“Design For Failure”“Healthy attitude about Failure”“Resilient (to Failure)”THE OUTAGES WILL CONTINUEUNTIL THE APPROACHIMPROVES ;-)
    • GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
    • define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
    • Resilience is a product of People & Technology
    • GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose when discover them, instead of letting that be determined by the next real disaster.
    • start small...http://www.flickr.com/photos/oakleyoriginals/5674150237
    • increase awareness http://www.flickr.com/photos/maunzy/5099921731
    • build confidencehttp://www.flickr.com/photos/skevbo/4864249944
    • full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
    • safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067
    • no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography
    • Lessons Learned:Every Post-mortem, Ever...
    • Root Cause:“a perfect storm ofimpossible events”
    • Lesson #1“we have a bunch of manual processes which we need to automate”
    • (thanks to Hyperbole & A Half))
    • Infrastructure as Code:Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare resources.
    • Lesson #1.5“Golden Images don’t work in dynamic (ie:Cloud) Environments”
    • Golden Images are not the answer• Gold is heavy & expensive• Hard to transport• Hard to mold• Easy to lose configuration detail http://www.flickr.com/photos/garysoup/2977173063/
    • When this Varnish Jboss App Memcache Postgres Slaves Postgres Master
    • Becomes this... Varnish Jboss App Memcache Postgres Slaves Postgres Master
    • • Load balancerChange is only Constant config • Nagios host ping Varnish • Nagios host ssh • Nagios host HTTP • Nagios host app Jboss App health • Graphite CPU • Graphite Memcache Memory • Graphite Disk Postgres Slaves • Graphite SNMP • Memcache• Postgres Master 12+ resource changes for 1 node addition firewall • Postgres firewall
    • Scale & Complexity increase! AZ2 DC1 AZ3
    • CLONING CANNOT COPE WITH THIS http://www.flickr.com/photos/evelynishere/2798236471/
    • (thanks to Hyperbole & A Half )
    • Lesson #2 “We need betterIncident Management”
    • http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
    • ♥ Dev+Ops CultureSlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
    • make it fun!
    • Lesson #3“(load balancing|website|DNS|database|deployment|storage| provisioning|cloud|etc) failover didn’t work. We need to test & maintain our emergency tools & processes”
    • use it or lose it.
    • Give yourself lots of knobs& levers & use them as part of regular process.
    • MTTR > MTBF
    • I ❤ MTTR
    • (thanks to Hyperbole & A Half )
    • GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
    • Thank You!!!Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com