Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
“Rebooting a Cloud”                http://www.flickr.com/photos/mwichary/2355832413/
Opscode is Hiring!
Press button,Receive candy!(who here wants candy?)(there is always someone.)                             http://www.flickr....
“Rebooting a Cloud”                http://www.flickr.com/photos/mwichary/2355832413/
Complexity Increases,Latent Defects Accumulate,            & FAILURE HAPPENS!
define: The Nines (roughly)   99%	 5256 min (3.5 days)   99.9%	 528 min ( 8.8 hours )   99.99% 53 min   99.999% 5 min   99....
99.9% * 99.9% * 99.9%    = 99.7%(oops!)
Everyone goes through  the same thing...
DenialThis isn’t happening.
ANGER“HOW THE $^@* COULDTHIS BE HAPPENING!!!”
(more anger)“#@(* #$#$ %@%!!#$@ #*$#@$ @$*&#($*&@#(*$(”
BargainingIf we do this one thing, it will       stop happening.
DepressionWhat’s the point if this is just going to happen?
Acceptance“Failure Happens. We’ll just   have to design for it.”
Successful companies say:“Failure Happens”“Embrace Failure”“Design For Failure”“Healthy attitude about Failure”“Resilient ...
GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-fl...
define: GameDay   An exercise designed to increase   Resilience through large-scale fault   injection across critical syste...
Resilience is a product of People & Technology
GameDay increases Resilience in 3 ways Preparation  ‣ Identification and mitigation of risks and impact from    failure  ‣...
start small...http://www.flickr.com/photos/oakleyoriginals/5674150237
increase awareness http://www.flickr.com/photos/maunzy/5099921731
build confidencehttp://www.flickr.com/photos/skevbo/4864249944
full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
safety standards &                   “building codes”http://www.flickr.com/photos/peregrinari/3801964067
no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography
Lessons Learned:Every Post-mortem, Ever...
Root Cause:“a perfect storm ofimpossible events”
Lesson #1“we have a bunch of manual processes which we need to    automate”
(thanks to Hyperbole & A Half))
Infrastructure as Code:Enable the reconstruction of the  business from nothing but a   source code repository, an applicat...
Lesson #1.5“Golden Images don’t work in dynamic (ie:Cloud) Environments”
Golden Images are not the answer• Gold is heavy & expensive• Hard to transport• Hard to mold• Easy to lose configuration d...
When this            Varnish                      Jboss App            Memcache            Postgres Slaves            Post...
Becomes this...              Varnish                        Jboss App              Memcache              Postgres Slaves  ...
•           Load balancerChange is only Constant              config                         •           Nagios host ping ...
Scale & Complexity increase!           AZ2   DC1                    AZ3
CLONING CANNOT COPE     WITH THIS              http://www.flickr.com/photos/evelynishere/2798236471/
(thanks to Hyperbole & A Half )
Lesson #2    “We need betterIncident Management”
http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
♥                                            Dev+Ops                                                                  Cult...
make it fun!
Lesson #3“(load balancing|website|DNS|database|deployment|storage|    provisioning|cloud|etc)      failover didn’t work. W...
use it or lose it.
Give yourself lots of knobs& levers & use them as part    of regular process.
MTTR > MTBF
I ❤ MTTR
(thanks to Hyperbole & A Half )
GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-fl...
Thank You!!!Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
Rebooting a Cloud
Rebooting a Cloud
Rebooting a Cloud
Rebooting a Cloud
Upcoming SlideShare
Loading in...5
×

Rebooting a Cloud

3,350

Published on

Video of this talk is here:
http://www.opscode.com/blog/2012/02/14/automate-all-the-things/

Published in: Technology, Business
3 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
3,350
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
3
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "Rebooting a Cloud"

  1. 1. “Rebooting a Cloud” Jesse Robbins Cofounder, Opscode @jesserobbins http://www.flickr.com/photos/mwichary/2355832413/
  2. 2. Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com
  3. 3. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  4. 4. Opscode is Hiring!
  5. 5. Press button,Receive candy!(who here wants candy?)(there is always someone.) http://www.flickr.com/photos/mwichary/2355832413/
  6. 6. “Rebooting a Cloud” http://www.flickr.com/photos/mwichary/2355832413/
  7. 7. Complexity Increases,Latent Defects Accumulate, & FAILURE HAPPENS!
  8. 8. define: The Nines (roughly) 99% 5256 min (3.5 days) 99.9% 528 min ( 8.8 hours ) 99.99% 53 min 99.999% 5 min 99.9999% 30 Seconds 99.99999% 3 Seconds
  9. 9. 99.9% * 99.9% * 99.9% = 99.7%(oops!)
  10. 10. Everyone goes through the same thing...
  11. 11. DenialThis isn’t happening.
  12. 12. ANGER“HOW THE $^@* COULDTHIS BE HAPPENING!!!”
  13. 13. (more anger)“#@(* #$#$ %@%!!#$@ #*$#@$ @$*&#($*&@#(*$(”
  14. 14. BargainingIf we do this one thing, it will stop happening.
  15. 15. DepressionWhat’s the point if this is just going to happen?
  16. 16. Acceptance“Failure Happens. We’ll just have to design for it.”
  17. 17. Successful companies say:“Failure Happens”“Embrace Failure”“Design For Failure”“Healthy attitude about Failure”“Resilient (to Failure)”THE OUTAGES WILL CONTINUEUNTIL THE APPROACHIMPROVES ;-)
  18. 18. GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  19. 19. define: GameDay An exercise designed to increase Resilience through large-scale fault injection across critical systems. Part of a larger discipline called Resilience Engineering. Not new, just new to us ;-)
  20. 20. Resilience is a product of People & Technology
  21. 21. GameDay increases Resilience in 3 ways Preparation ‣ Identification and mitigation of risks and impact from failure ‣ Reduces frequency of failure (MTBF) ‣ Reduces duration of recovery (MTTR) Participation ‣ Builds confidence & competence responding to failure and under stress. ‣ Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types. Exercises ‣ Trigger and expose “latent defects” ‣ Choose when discover them, instead of letting that be determined by the next real disaster.
  22. 22. start small...http://www.flickr.com/photos/oakleyoriginals/5674150237
  23. 23. increase awareness http://www.flickr.com/photos/maunzy/5099921731
  24. 24. build confidencehttp://www.flickr.com/photos/skevbo/4864249944
  25. 25. full scale, live fire exerciseshttp://tacomafiredepartment.blogspot.com/2010/05/west-slope-training-burn.html
  26. 26. safety standards & “building codes”http://www.flickr.com/photos/peregrinari/3801964067
  27. 27. no substitutes for experience... Failure free operations require experience with failure.Ana Grillo © Ana Grillo Photography
  28. 28. Lessons Learned:Every Post-mortem, Ever...
  29. 29. Root Cause:“a perfect storm ofimpossible events”
  30. 30. Lesson #1“we have a bunch of manual processes which we need to automate”
  31. 31. (thanks to Hyperbole & A Half))
  32. 32. Infrastructure as Code:Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare resources.
  33. 33. Lesson #1.5“Golden Images don’t work in dynamic (ie:Cloud) Environments”
  34. 34. Golden Images are not the answer• Gold is heavy & expensive• Hard to transport• Hard to mold• Easy to lose configuration detail http://www.flickr.com/photos/garysoup/2977173063/
  35. 35. When this Varnish Jboss App Memcache Postgres Slaves Postgres Master
  36. 36. Becomes this... Varnish Jboss App Memcache Postgres Slaves Postgres Master
  37. 37. • Load balancerChange is only Constant config • Nagios host ping Varnish • Nagios host ssh • Nagios host HTTP • Nagios host app Jboss App health • Graphite CPU • Graphite Memcache Memory • Graphite Disk Postgres Slaves • Graphite SNMP • Memcache• Postgres Master 12+ resource changes for 1 node addition firewall • Postgres firewall
  38. 38. Scale & Complexity increase! AZ2 DC1 AZ3
  39. 39. CLONING CANNOT COPE WITH THIS http://www.flickr.com/photos/evelynishere/2798236471/
  40. 40. (thanks to Hyperbole & A Half )
  41. 41. Lesson #2 “We need betterIncident Management”
  42. 42. http://www.flickr.com/photos/mrwilleeumm/1102333740/sizes/o/in/photostream/
  43. 43. ♥ Dev+Ops CultureSlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
  44. 44. make it fun!
  45. 45. Lesson #3“(load balancing|website|DNS|database|deployment|storage| provisioning|cloud|etc) failover didn’t work. We need to test & maintain our emergency tools & processes”
  46. 46. use it or lose it.
  47. 47. Give yourself lots of knobs& levers & use them as part of regular process.
  48. 48. MTTR > MTBF
  49. 49. I ❤ MTTR
  50. 50. (thanks to Hyperbole & A Half )
  51. 51. GameDaySlide Courtesy of John Allspaw - http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickrhttp://www.flickr.com/photos/dnorman/2678090600
  52. 52. Thank You!!!Jesse RobbinsCofounder, Opscode@jesserobbinsjesse@opscode.com

×