• Save
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Rebooting a Cloud

on

  • 3,647 views

Video of this talk is here:

Video of this talk is here:
http://www.opscode.com/blog/2012/02/14/automate-all-the-things/

Statistics

Views

Total Views
3,647
Views on SlideShare
3,172
Embed Views
475

Actions

Likes
8
Downloads
0
Comments
3

5 Embeds 475

http://www.opscode.com 417
http://blog.sahsu.mobi 55
https://si0.twimg.com 1
https://twimg0-a.akamaihd.net 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Rebooting a Cloud Presentation Transcript

  • 1. “Rebooting a Cloud” Jesse Robbins Cofounder, Opscode @jesserobbins http://www.flickr.com/photos/mwichary/2355832413/
  • 2. Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
  • 3. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  • 4. Opscode is Hiring!
  • 5. Press button,Receive candy!(who here wants candy?)(there is always someone.) http://www.flickr.com/photos/mwichary/2355832413/
  • 6. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  • 7. Complexity Increases,Latent Defects Accumulate, & FAILURE HAPPENS!
  • 8. define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  • 9. 99.9% * 99.9% * 99.9% = 99.7%(oops!)
  • 10. Everyone goes through the same thing...
  • 11. DenialThis isn’t happening.
  • 12. ANGER“HOW THE $^@* COULDTHIS BE HAPPENING!!!”
  • 13. (more anger)“#@(* #$#$ %@%!!#$@ #*$#@$ @$*&#($*&@#(*$(”
  • 14. BargainingIf we do this one thing, it will stop happening.
  • 15. DepressionWhat’s the point if this is just going to happen?
  • 16. Acceptance“Failure Happens. We’ll just have to design for it.”
  • 17. Successful companies say:“Failure Happens”“Embrace Failure”“Design For Failure”“Healthy attitude about Failure”“Resilient (to Failure)”THE OUTAGES WILL CONTINUEUNTIL THE APPROACHIMPROVES ;-)
  • 18. GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  • 19. define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  • 20. Resilience is a product of People & Technology
  • 21. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose when discover them, instead of letting that be determined by the next real disaster.
  • 22. start small...http://www.flickr.com/photos/oakleyoriginals/5674150237
  • 23. increase awareness http://www.flickr.com/photos/maunzy/5099921731
  • 24. build confidencehttp://www.flickr.com/photos/skevbo/4864249944
  • 25. full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
  • 26. safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067
  • 27. no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography
  • 28. Lessons Learned:Every Post-mortem, Ever...
  • 29. Root Cause:“a perfect storm ofimpossible events”
  • 30. Lesson #1“we have a bunch of manual processes which we need to automate”
  • 31. (thanks to Hyperbole & A Half))
  • 32. Infrastructure as Code:Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare resources.
  • 33. Lesson #1.5“Golden Images don’t work in dynamic (ie:Cloud) Environments”
  • 34. Golden Images are not the answer• Gold is heavy & expensive• Hard to transport• Hard to mold• Easy to lose configuration detail http://www.flickr.com/photos/garysoup/2977173063/
  • 35. When this Varnish Jboss App Memcache Postgres Slaves Postgres Master
  • 36. Becomes this... Varnish Jboss App Memcache Postgres Slaves Postgres Master
  • 37. • Load balancerChange is only Constant config • Nagios host ping Varnish • Nagios host ssh • Nagios host HTTP • Nagios host app Jboss App health • Graphite CPU • Graphite Memcache Memory • Graphite Disk Postgres Slaves • Graphite SNMP • Memcache• Postgres Master 12+ resource changes for 1 node addition firewall • Postgres firewall
  • 38. Scale & Complexity increase! AZ2 DC1 AZ3
  • 39. CLONING CANNOT COPE WITH THIS http://www.flickr.com/photos/evelynishere/2798236471/
  • 40. (thanks to Hyperbole & A Half )
  • 41. Lesson #2 “We need betterIncident Management”
  • 42. http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
  • 43. ♥ Dev+Ops CultureSlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
  • 44. make it fun!
  • 45. Lesson #3“(load balancing|website|DNS|database|deployment|storage| provisioning|cloud|etc) failover didn’t work. We need to test & maintain our emergency tools & processes”
  • 46. use it or lose it.
  • 47. Give yourself lots of knobs& levers & use them as part of regular process.
  • 48. MTTR > MTBF
  • 49. I ❤ MTTR
  • 50. (thanks to Hyperbole & A Half )
  • 51. GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  • 52. Thank You!!!Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com