Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Clearing the Way For SRE In the Enterprise - Damon Edwards at SREcon EU 2018

2,680 views

Published on

Slides from Damon Edwards presentation at SREcon in Dusseldorf, Germany on 30 Aug 2018.

Video will be available here:
https://www.usenix.org/conference/srecon18europe/presentation/edwards

Published in: Technology

Clearing the Way For SRE In the Enterprise - Damon Edwards at SREcon EU 2018

  1. 1. Clearing the Way For SRE in the Enterprise Damon Edwards @damonedwards
  2. 2. Community Ops Improvement DevOps Ops Tools Damon Edwards
  3. 3. Digital Agile DevOps CI/CD Cloud Docker Kubernetes Microservices CHANGE Wow That is cool I wish I could work there
  4. 4. OpsBusiness Idea Shorter Time-to-Market Fast Feedback from Users Dev Ops Running Services Improved Quality Digital and DevOps Availability Auditing Security Compliance "Go faster!" “Open up!” “Lock it down!” “Great for Dev, but what about Ops?”
  5. 5. Our transformation has largely ignored Ops. Any ideas? Have you heard of SRE? Google does it.
  6. 6. Jane Doe Systems Administrator
  7. 7. Jane Doe Systems Administrator We have SysAdmins
  8. 8. Jane Doe Systems Administrator They should be SREs!
  9. 9. Jane Doe SRE They should be SREs!
  10. 10. ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  11. 11. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View
  12. 12. SysAdmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. ansformation has largely nored Ops. Any ideas? Have you heard of SRE? Google does it. Everything takes too long, cost too much, and break too often! Executive View (False) SRE Overloaded. Constant firefighting. Waiting in ticket queues for everything. Things break. Break again. And again. Everyone is busy, but it doesn’t get any better. Our transformation has largely ignored Ops. Any ideas? Have you h Google Everything takes too long, cost too much, and break too often! Executive View
  13. 13. Changing job titles or adding individual skills doesn’t make systems administrators SREs.
  14. 14. Principles of SRE are what set SRE apart
  15. 15. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  16. 16. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  17. 17. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service)
  18. 18. SLO and Error Budgets: Tools for Shared Responsibility 0 100 Service Level Objective Error Budget* Service Level Indicator (*Use this to improve the service) DEV BIZ Ops
  19. 19. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences
  20. 20. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today
  21. 21. Principles of SRE are what set SRE apart 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  22. 22. Principles of SRE are what set SRE apart Stephen Thorne At DevOps Enterprise Summit London 2018 “Principles of SRE” https://youtu.be/c-w_GYvi0eA 1. SRE needs Service Level Objectives, with consequences 2. SREs have time to make tomorrow better than today 3. SRE teams have the ability to regulate their workload
  23. 23. Forces That Undermine SRE Principles Silos Queues Excessive Toil Low Trust
  24. 24. Forces That Undermine SRE Principles Silos Queues Excessive Toil Low Trust
  25. 25. Silos Backlog Information PrioritiesTools
  26. 26. Backlog Information I need X PrioritiesTools Silos
  27. 27. Backlog Information I need X PrioritiesTools Silos Backlog I do X Requests for X Silo A Information Priorities Silo B Tools
  28. 28. Silos cause disconnects and mismatches Backlog Information I need X PrioritiesTools Backlog I do X Requests for X Silo A Information Priorities Silo B Tools Context Context Process Process Tooling Tooling Capacity Capacity
  29. 29. 1 2 3 Silos Interfere with feedback loops
  30. 30. 1 2 3 Silos Interfere with feedback loops Producer Consumer Ops Ops Ops
  31. 31. Function A Function B Function C Silos create labor pools of functional specialists Requests fulfilled by semi- manual or manual effort Primary management focus is on protecting team capacity
  32. 32. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  33. 33. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X
  34. 34. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X
  35. 35. Silos Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Disjointed silos make meaningful SLOs and shared responsibility almost impossible X Siloed labor pools, disconnected processes and tools, and slow feedback loops tend to consume all available capacity X Struggling to keep up with demand and unable to protect capacityX
  36. 36. Forces That Undermine SRE Principles Silos Queues Toil Low Trust
  37. 37. How do we cover for our cross-silo disconnects and mismatches? Silo A Silo B
  38. 38. How do we cover for our cross-silo disconnects and mismatches? Silo A Silo B Ticket Queue
  39. 39. ?? Silo A Silo B We all know how well that works Ticket Queue
  40. 40. Request queues are an expensive way to manage work Ticket Queue Queues Create… Longer Cycle Time Increased Risk More Variability More Overhead Lower Quality Less Motivation Adapted from Donald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development
  41. 41. What do queues do to value streams?
  42. 42. What do queues do to value streams? Queue A Queue B
  43. 43. What do queues do to value streams? Queue A Queue B Queues disintegrate and obfuscate value streams
  44. 44. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue
  45. 45. Tickets queues become “snowflake makers” ?? Silo A Silo B Ticket Queue Snowflakes (each unique, technically acceptable but unreproducible and brittle)
  46. 46. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  47. 47. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X
  48. 48. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X
  49. 49. Ticket Queues Undermine SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Tickets reinforce siloed behaviors and obfuscate the value stream X Longer cycle time, more variability, more overhead, lower quality, and more snowflakes consume available capacity X Queues obfuscate the pressure being put on request fulfillersX
  50. 50. Forces That Undermine Operations Silos Queues Toil Low Trust
  51. 51. Toil is the enemy of SRE
  52. 52. Toil is the enemy of SRE “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” -Vivek Rau Google
  53. 53. Toil vs. Engineering Work Toil Engineering Work Lacks Enduring Value Builds Enduring Value Rote, Repetitive Creative, Iterative Tactical Strategic Increases With Scale Enables Scaling Can Be Automated Requires Human Creativity
  54. 54. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  55. 55. Excessive toil prevents fixing the system Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”)
  56. 56. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  57. 57. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X
  58. 58. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X
  59. 59. Excessive Toil Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Buried in toil keeps team from contributing engineering work to uphold their end of the shared responsibility deal X Buried in toil… no capacity for engineering work to reduce toil.X Buried in toil… no capacity for engineering work to reduce toil.X
  60. 60. Forces That Undermine Operations Silos Queues Toil Low Trust
  61. 61. Where are decisions made? Who can take action? escalate 1° 2° 3° 4° escalate escalateor Decisions made here
  62. 62. All work is contextual John Allspaw
  63. 63. All work is contextual rm -rf $PATHNAME John Allspaw
  64. 64. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  65. 65. All work is contextual rm -rf $PATHNAME John Allspaw
  66. 66. All work is contextual rm -rf $PATHNAME John Allspaw
  67. 67. All work is contextual rm -rf $PATHNAME Is this dangerous? John Allspaw
  68. 68. All work is contextual rm -rf $PATHNAME John Allspaw
  69. 69. All work is contextual rm -rf $PATHNAME Answer is always “it depends” John Allspaw
  70. 70. escalate 1° 2° 3° 4° escalate escalateor Context Where are decisions made? Who can take action?
  71. 71. Low trust + approvals = illusion of control Ticket System
  72. 72. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and
  73. 73. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”)
  74. 74. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”)
  75. 75. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  76. 76. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  77. 77. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”)
  78. 78. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with?
  79. 79. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call?
  80. 80. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  81. 81. Low trust + approvals = illusion of control Ticket System Add up the total number of approval requests and …subtract the info radiators (“I need to be in the loop”) …subtract the CYAs (“Prove you followed the process”) …subtract the too removed to judge (“mostly guessing”) How many are you left with? How many were the right call? How many got rejected?
  82. 82. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload?
  83. 83. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X
  84. 84. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X
  85. 85. Low Trust Undermines SRE Principles 1. Org has Service Level Objectives, with consequences? 2. SREs have time to make tomorrow better than today? 3. SRE teams have the ability to regulate their workload? Cultures of low trust have a really difficult time with shared responsibility X People closest to problems know what to fix but tasking, priorities, and decisions are largely out of their control X People aren’t trusted to plan or design their own workX
  86. 86. Forces That Undermine Operations Silos Queues Toil Low Trust
  87. 87. So what can we do differently?
  88. 88. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan)
  89. 89. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple
  90. 90. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery
  91. 91. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata)
  92. 92. Lean on Lean to find what to fix PD TS W EP M M M TS ? PD TS W EP M M M TS ? Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Countermeasure Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Lorem ipsum dolor In aliquet rhoncus urna. Proin eget diam volutpat. Map the end-to-end flow of information and artifacts (using a recent delivery or event) Identify what slows lead times, undermines quality, and impacts flow 1 2 3 Identify countermeasures and create improvement storyboards (justification/plan) All processes should be studied with an improvement disciple Incidents are just as much a “process” as delivery Look to Lean for proven improvement techniques (value stream mapping, waste analysis, improvement kata) Make it a part of your organization’s discipline
  93. 93. Get rid of as many silos as possible Old Silo A Old Silo B Old Silo C Old Silo D
  94. 94. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible
  95. 95. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 1: get rid of as many handoffs as possible
  96. 96. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Get rid of as many silos as possible Key 2: “Horizontal” shared responsibility, not everyone do everything! Key 1: get rid of as many handoffs as possible
  97. 97. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  98. 98. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model
  99. 99. Shared responsibility matters more than org model Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Development Team 1 Development Team 2 Development Team n SRE Team Clear handoff requirements Error budget consequences “Netflix" Model “Google” Model Same high-quality, high-velocity results!
  100. 100. Why focus on getting rid of handoffs?
  101. 101. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets
  102. 102. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive
  103. 103. Why focus on getting rid of handoffs? 1. Your people are your most valuable assets 2. The SRE skillset is expensive 3. Stay out of their way!
  104. 104. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This:
  105. 105. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Observe Orient Decide Action SRE OODA Loop Reduce friction:
  106. 106. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Observe Orient Decide Action SRE OODA Loop Reduce friction:
  107. 107. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Observe Orient Decide Action SRE OODA Loop Reduce friction:
  108. 108. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  109. 109. SREs are expensive, stay out of their way! Ticket Queue ✅Ticket Queue Ticket Queue Ticket Queue Backlog Ticket Queue Ticket Queue ✅ Backlog Not this: This: Invest in the right instrumentation Invest in collaboration, checklists, investigatory tools Empower them to make decisions! Empower them to take action! Observe Orient Decide Action SRE OODA Loop Reduce friction:
  110. 110. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities
  111. 111. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue
  112. 112. What about the handoffs you can’t get rid of? Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Specialist Capabilities Specialist Capabilities Specialist Capabilities Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue Ticket Queue
  113. 113. Operations as a Service: Turn handoffs into self-service Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist
  114. 114. Development Team 1 Development Team 2 Development Team n Ops/SRE Team Operations as a Service On Demand On Demand On Demand On Demand Ops (builds & operates) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Operations as a Service: Works with any org model
  115. 115. Operations as a Service: Popular Uses for SRE Environment "I could fix it, if I could get to it”
  116. 116. Operations as a Service: Popular Uses for SRE Environment "I could fix it, if I could get to it” Environment O a a S
  117. 117. Operations as a Service: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top”
  118. 118. Operations as a Service: Popular Uses for SRE “Avoiding the dogpile” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “$ top” “$ top” “$ top” “$ top” “$ top”“$ top” I think its a problem with dbcluster07-store2.uswest.acme dbcluster07- store2.uswest. acme “$ top” “Healthcheck store2 - all” OaaS
  119. 119. “I don’t read wikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… Operations as a Service: Popular Uses for SRE
  120. 120. “I don’t read wikis. I’m an expert.” docs Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart -doit -now” I’ve done this before. I’ve got this. Environment docs Later… OaaS Service has changed. This flag is now required or bad things will happen! Pause monitoring first or we all get woken up! “restart” I’ve done this before. I’ve got this. Environment Later… Update Restart Job ✅ OaaS Operations as a Service: Popular Uses for SRE
  121. 121. Operations as a Service: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y.
  122. 122. Operations as a Service: Popular Uses for SRE “Uneven and hidden skills” I don’t know how to do X. I know how to do X. I know how to do Y. I don’t know how to do Y. OaaS “Do X” “Define Y Procedure” “Define X Procedure” “Do Y” “Do X+Y”
  123. 123. “Let me do that for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work Operations as a Service: Popular Uses for SRE
  124. 124. “Let me do that for you again… and again” Done. I need you to do X Later… Ticket Other work Done. I need you to do X Later… Ticket Other work Sigh..Done. I need you to do X Ticket Other work OaaS Do X Later… Other work 1 Later… Other work 2 Other work 3 Do X Do X OaaS OaaS Operations as a Service: Popular Uses for SRE
  125. 125. Use tickets only for what they are good for Ticket System
  126. 126. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions Ticket System
  127. 127. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Ticket System
  128. 128. Use tickets only for what they are good for 1.Documenting true problems/issues/exceptions 2.Routing for necessary approvals Not as a general purpose work management system! Ticket System
  129. 129. But won’t Security or Compliance stop you? Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Build-in Security Here Build-in Compliance Here
  130. 130. But what about ITIL® ?
  131. 131. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible
  132. 132. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?”
  133. 133. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature
  134. 134. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model
  135. 135. But what about ITIL® ? • Ask ITIL people and they say SRE is ITIL compatible • Ask people who have seen ITIL implemented and they say “how?” • Agile+DevOps+SRE have self-regulation and shared responsibility features that seem to undermine ITIL command and control nature • ITIL “Standard Change” is often focus of discussion, but it still implies approval model • Straight talk: are we doing contortions to defend a sunk cost?
  136. 136. “Shift Left” the ability to take action escalate 1° 2° 3° 4° escalate escalateor
  137. 137. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor
  138. 138. “Shift Left” the ability to take action Push the ability to take action this direction escalate 1° 2° 3° 4° escalate escalateor OaaS Enablement and tooling
  139. 139. Reduce Toil
  140. 140. Reduce Toil 1. Track toil levels for each team
  141. 141. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team
  142. 142. Reduce Toil 1. Track toil levels for each team 2. Set toil limits for each team 3. Fund efforts to reduce toil (with emphasis on teams over toil limits)
  143. 143. Start a book club
  144. 144. Recap SRE is more than a title Leverage the Operations as a Service design pattern “Shift-Left” control and decision making. Old Silo A Old Silo B Old Silo C Old Silo D Cross-Functional Team 1 Cross-Functional Team 2 Cross-Functional Team n Focus on removing silos and queues Operations as a Service On Demand On Demand On Demand On Demand Ops (embedded)Cross-Functional Product Team 1 Cross-Functional Product Team n Ops (embedded) Ops (builds & operates) Cross-Functional Product Team 2 Ops (embedded) Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Ops Capability SRE, Dev, or Specialist Reduce toil to create capacity to change Toil Engineering Work E.W.Toil Reduce toil Improve the business ǡ No capacity to reduce toil No capacity to improve business Toil at manageable percentage of capacity Toil at unmanageable percentage of capacity (“Engineering Bankruptcy”) Understand the forces undermining SRE ITIL Book 1 ITIL Book 2 ITIL Book 3 ITIL Book 4 ITIL Book 5 Quality! is job #1 Sys Admin CAB CALENDAR Your new title is SRE. Now write code and be better at ops. PROVISIONING PROCESS Dilbert characters © Scott Adams www.dilbert.com
  145. 145. Let’s talk… @damonedwards damon@rundeck.com https://www.rundeck.com/oaas Dive Deeper Into Operations as a Service:

×