DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model

5,283 views

Published on

Helping customers evaluate their ability to deploy and operate systems while managing incidents is key to our Consulting practice. We have developed an operations maturity model that provides a roadmap for understanding and improving mean time to production while setting realistic expectations. This session will explain the challenges and thresholds for becoming a more effective organization.

Published in: Software, Technology, Business

DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity Model

  1. 1. Chef’s Operations Maturity Model: Helping Horses Become Unicorns Matt Ray DevopsDays Austin May 5, 2014
  2. 2. Introductions • Matt Ray • Director Partner Integration at Chef • matt@getchef.com • mattray GitHub|IRC| Twitter
  3. 3. “If there’s anything that all horses hate, it’s hearing stories about unicorns.” Chris Little
  4. 4. http://pichost.me/1468004/ DevOps Unicorns • Etsy • Facebook • Netflix
  5. 5. https://keepinghouseandhorse.files.wordpress.com/2013/10/photoshop3.jpeg But… Enterprise • Our applications are too complex • Politics get in the way • We’ve always done it this way
  6. 6. It’s Not Magic • Not everyone requires Continuous Delivery • They require: •Higher reliability •Greater visibility •More resilience •Faster response
  7. 7. https://img0.etsystatic.com/000/0/5209298/il_fullxfull.282855902.jpg How Do We Get There?
  8. 8. The Map is not the Territory • Comparative study of Operational Maturity Models • On one end: ad-hoc, slow to respond, “traditional” approach • At the other: very fast, fully automated, and disaster indifferent • Figure out what is most important to your Organization https://www.chimacumtack.com/images/measurehorse.jpg
  9. 9. Fitting the Model • Varying degrees of adoption • Operational trends often correlated and relational, but not definitive • Roadmap for improving time to deployment and lower time to recovery • Understand the challenges, set real expectations for progress http://www.web3dservice.com/3d_models/images/unicorn_3d_model_03.jpg
  10. 10. Roadmap Considerations • Hardware Management • OS Management • Infrastructure Management • Software Deployments • Incident Management • Disaster Recovery http://cultofunicorn.com/wp-content/uploads/2013/05/Unicorn_horse.jpg
  11. 11. Hardware Management
  12. 12. Every Server is Sacred! • HA Support expected across the entire stack • Dependence on vendor/on-site SE for replacement/maintenance • “This is the best hardware money can buy!” • Architecture Review and Request Forms for all changes • “Tier 1” data centers • Every project special snowflake
  13. 13. 1 SysAdmin to 25-250 systems? Automate Common Tasks
  14. 14. Maybe not ALL servers are sacred… • Start using some farms of standardized machines • Fewer support contracts, less dependence on vendor/on-site support • Architecture Reviews for new services with some implementation standardization • HA support across most of the stack • Probably still using “Tier 1” data centers with excess redundancy
  15. 15. 1 Systems Engineer to 250-500 systems Configuration Management
  16. 16. Most of these servers aren’t sacred? • Limited support on ALL systems • On-site support used sparingly, lower-skill onsite staff for “normal” failures • Architecture Reviews only manage exceptions. Automated requests may be exposed via emerging APIs • Wide adoption of virtualization: server instances are commoditized • Hardware becoming standardized and easy to replace • Smaller, more efficient data centers. • Limited redundancy with hot/hot/hot N+1/N HA strategies
  17. 17. Application Management 1 Systems Engineer to 500-1000 Systems
  18. 18. None of the servers are sacred • Infrastructure as a Service • Hardware (if any) is fully commoditized • Hardware is completely standardized, special cases are regarded as a risk to business • Redundant Array of Inexpensive Data centers
  19. 19. 1 Site Reliability Engineer to 1000+ Systems Continuous Delivery
  20. 20. 1 Site Reliability Engineer to 1000+ Systems Continuous Delivery
  21. 21. Operating System Management
  22. 22. Operating Systems Management • Many OS flavors and versions. Manual, irregular patching • Limited flavors and versions, planned upgrades. “Patch Tuesday!” • Standard versions using JEOS with regular upgrades. Automated patching. • Internally maintained versions, constant upgrades
  23. 23. http://www.smallwebs.com/Swords/images/UK1796HC2d/SCOTLANDFOREVER2.jpg Incident Management
  24. 24. Incident Threshold: Recovery Time • Which teams have regular on call responsibilities? • What is expected of someone on call? • How are people notified & engaged on an incident?
  25. 25. Incident Threshold: Recovery Time • "Something is wrong!" 12+ hours • "Something is wrong with the…!" 1-12 hours • "Something went wrong with your deployment!” <60 minutes • "The core infrastructure fabric is down!” seconds - 10 minutes
  26. 26. Postmortems http://photography.nationalgeographic.com/photography/photo-of-the-day/
  27. 27. Postmortems • Postmortem Focus • Root Cause Orientation • Root Cause Mitigation/ Resolution • Root Cause Elimination Rate http://img3.wikia.nocookie.net/__cb20111008164412/mlpfanart/images/thumb/b/b2/Twilight_Sparkle_Angry_by_Ivan-Chan.png/597px-Twilight_Sparkle_Angry_by_Ivan-Chan.png
  28. 28. Postmortems: Ad Hoc • "Human Error”: blame finding & punishment • "Triggering Event”: blaming specific operator error or specific hardware failures • Cycle between protecting heroes and then firing them • <10% - Mostly break fix detection
  29. 29. Postmortems: Formal • Focus on "Triggering Event" or "Human Error", but blaming process and/or infrastructure • "Let's implement more process and overhead” • 10% within 3 months - mostly simple fixes • Tracking but little progress against goals vs. other priorities, frequent recurrence
  30. 30. Postmortems: Officially "Blame Free" • Primary focus on on underlying technical root causes, systemic fixes • Improved tooling, programatic checks, operator tools for special cases. Some focus on building resiliency • 20% - Easily fixable issues eliminated within 3 months, programs to eliminate larger issues over time
  31. 31. Postmortems: “5 Whys” • Including business and cultural issues • Primary focus on insights and opportunities from lessons learned • Increased resiliency and appropriate operator tools, focus on self-healing fixes • Recurrence becomes infrequent and is a big deal
  32. 32. Navigating the Change • Many more mile markers • Roadmap to improve your • Mean Time To Production • Mean Time to Recovery
  33. 33. Becoming a Unicorn is Possible • Approach the challenges with realistic expectations for your organization • Always room for improvement • Culture trumps everything http://webecoist.momtastic.com/wp-content/uploads/2010/09/unicorns_3x.jpg
  34. 34. Where Can I Download It? bit.ly/Chef-OMM
  35. 35. Thanks! Matt Ray matt@getchef.com @mattray ! Thanks to George Miranda, Paul Edelhertz & Jesse Robbins

×