Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

of

YouTube videos are no longer supported on SlideShare

View original on YouTube

Rebooting a Cloud Slide 2 Rebooting a Cloud Slide 3 Rebooting a Cloud Slide 4 Rebooting a Cloud Slide 5 Rebooting a Cloud Slide 6 Rebooting a Cloud Slide 7 Rebooting a Cloud Slide 8 Rebooting a Cloud Slide 9 Rebooting a Cloud Slide 10 Rebooting a Cloud Slide 11 Rebooting a Cloud Slide 12 Rebooting a Cloud Slide 13 Rebooting a Cloud Slide 14 Rebooting a Cloud Slide 15 Rebooting a Cloud Slide 16 Rebooting a Cloud Slide 17 Rebooting a Cloud Slide 18 Rebooting a Cloud Slide 19 Rebooting a Cloud Slide 20 Rebooting a Cloud Slide 21 Rebooting a Cloud Slide 22 Rebooting a Cloud Slide 23 Rebooting a Cloud Slide 24 Rebooting a Cloud Slide 25 Rebooting a Cloud Slide 26 Rebooting a Cloud Slide 27 Rebooting a Cloud Slide 28 Rebooting a Cloud Slide 29 Rebooting a Cloud Slide 30 Rebooting a Cloud Slide 31 Rebooting a Cloud Slide 32 Rebooting a Cloud Slide 33 Rebooting a Cloud Slide 34 Rebooting a Cloud Slide 35 Rebooting a Cloud Slide 36 Rebooting a Cloud Slide 37 Rebooting a Cloud Slide 38 Rebooting a Cloud Slide 39 Rebooting a Cloud Slide 40 Rebooting a Cloud Slide 41 Rebooting a Cloud Slide 42 Rebooting a Cloud Slide 43 Rebooting a Cloud Slide 44 Rebooting a Cloud Slide 45 Rebooting a Cloud Slide 46 Rebooting a Cloud Slide 47 Rebooting a Cloud Slide 48 Rebooting a Cloud Slide 49 Rebooting a Cloud Slide 50 Rebooting a Cloud Slide 51 Rebooting a Cloud Slide 52 Rebooting a Cloud Slide 53 Rebooting a Cloud Slide 54 Rebooting a Cloud Slide 55 Rebooting a Cloud Slide 56
Upcoming SlideShare
Failure Happens: CloudCamp Interop
Next

8 Likes

Share

Rebooting a Cloud

Video of this talk is here:
http://www.opscode.com/blog/2012/02/14/automate-all-the-things/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Rebooting a Cloud

  1. 1. “Rebooting a Cloud” Jesse Robbins Cofounder, Opscode @jesserobbins http://www.flickr.com/photos/mwichary/2355832413/
  2. 2. Jesse Robbins Cofounder, Opscode @jesserobbins jesse@opscode.com
  3. 3. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  4. 4. Opscode is Hiring!
  5. 5. Press button, Receive candy! (who here wants candy?) (there is always someone.) http://www.flickr.com/photos/mwichary/2355832413/
  6. 6. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  7. 7. Complexity Increases, Latent Defects Accumulate, & FAILURE HAPPENS!
  8. 8. define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  9. 9. 99.9% * 99.9% * 99.9% = 99.7% (oops!)
  10. 10. Everyone goes through the same thing...
  11. 11. Denial This isn’t happening.
  12. 12. ANGER “HOW THE $^@* COULD THIS BE HAPPENING!!!”
  13. 13. (more anger) “#@(* #$#$ %@%!!#$@ #* $#@$ @$*&#($*&@#(*$(”
  14. 14. Bargaining If we do this one thing, it will stop happening.
  15. 15. Depression What’s the point if this is just going to happen?
  16. 16. Acceptance “Failure Happens. We’ll just have to design for it.”
  17. 17. Successful companies say: “Failure Happens” “Embrace Failure” “Design For Failure” “Healthy attitude about Failure” “Resilient (to Failure)” THE OUTAGES WILL CONTINUE UNTIL THE APPROACH IMPROVES ;-)
  18. 18. GameDay Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr http://www.flickr.com/photos/dnorman/2678090600
  19. 19. define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  20. 20. Resilience is a product of People & Technology
  21. 21. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose when discover them, instead of letting that be determined by the next real disaster.
  22. 22. start small... http://www.flickr.com/photos/oakleyoriginals/5674150237
  23. 23. increase awareness http://www.flickr.com/photos/maunzy/5099921731
  24. 24. build confidence http://www.flickr.com/photos/skevbo/4864249944
  25. 25. full scale, live fire exercises http://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
  26. 26. safety standards & “building codes” http://www.flickr.com/photos/peregrinari/3801964067
  27. 27. no substitutes for experience... Failure free operations require experience with failure. Ana Grillo © Ana Grillo Photography
  28. 28. Lessons Learned: Every Post-mortem, Ever...
  29. 29. Root Cause: “a perfect storm of impossible events”
  30. 30. Lesson #1 “we have a bunch of manual processes which we need to automate”
  31. 31. (thanks to Hyperbole & A Half))
  32. 32. Infrastructure as Code: Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare resources.
  33. 33. Lesson #1.5 “Golden Images don’t work in dynamic (ie: Cloud) Environments”
  34. 34. Golden Images are not the answer • Gold is heavy & expensive • Hard to transport • Hard to mold • Easy to lose configuration detail http://www.flickr.com/photos/garysoup/2977173063/
  35. 35. When this Varnish Jboss App Memcache Postgres Slaves Postgres Master
  36. 36. Becomes this... Varnish Jboss App Memcache Postgres Slaves Postgres Master
  37. 37. • Load balancer Change is only Constant config • Nagios host ping Varnish • Nagios host ssh • Nagios host HTTP • Nagios host app Jboss App health • Graphite CPU • Graphite Memcache Memory • Graphite Disk Postgres Slaves • Graphite SNMP • Memcache • Postgres Master 12+ resource changes for 1 node addition firewall • Postgres firewall
  38. 38. Scale & Complexity increase! AZ2 DC1 AZ3
  39. 39. CLONING CANNOT COPE WITH THIS http://www.flickr.com/photos/evelynishere/2798236471/
  40. 40. (thanks to Hyperbole & A Half )
  41. 41. Lesson #2 “We need better Incident Management”
  42. 42. http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
  43. 43. ♥ Dev+Ops Culture Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
  44. 44. make it fun!
  45. 45. Lesson #3 “(load balancing|website|DNS| database|deployment|storage| provisioning|cloud|etc) failover didn’t work. We need to test & maintain our emergency tools & processes”
  46. 46. use it or lose it.
  47. 47. Give yourself lots of knobs & levers & use them as part of regular process.
  48. 48. MTTR > MTBF
  49. 49. I ❤ MTTR
  50. 50. (thanks to Hyperbole & A Half )
  51. 51. GameDay Slide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr http://www.flickr.com/photos/dnorman/2678090600
  52. 52. Thank You!!! Jesse Robbins Cofounder, Opscode @jesserobbins jesse@opscode.com
  • jfcaenen

    Feb. 13, 2013
  • phoenix2life

    Dec. 28, 2012
  • rberger

    Apr. 28, 2012
  • ochoto

    Feb. 17, 2012
  • sheeplogh

    Feb. 16, 2012
  • fire9

    Feb. 15, 2012
  • eliasmussi

    Feb. 15, 2012
  • nxhack

    Feb. 14, 2012

Video of this talk is here: http://www.opscode.com/blog/2012/02/14/automate-all-the-things/

Views

Total views

4,633

On Slideshare

0

From embeds

0

Number of embeds

602

Actions

Downloads

0

Shares

0

Comments

0

Likes

8

×