Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Helping operations top-heavy teams the smart way

179 views

Published on

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Helping operations top-heavy teams the smart way

  1. 1. Helping operations top-heavy teams the smart way (Lessons from my experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer
  2. 2. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  3. 3. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  4. 4. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not
  5. 5. • How to identify team anti-patterns • How to work through high-toil • How to create sustainable workloads This talk is
  6. 6. Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3 Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A
  7. 7. Background
  8. 8. Personal Experience in the past 15 months ASSISTANCE RENDERED • Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability
  9. 9. Scenario 1: Resource Allocation
  10. 10. Problem Statement Resource Allocations • Lack of written documentation • Backlog of work for clients • Alert Fatigue
  11. 11. Scenario 2: Technical Debt
  12. 12. Problem Statement Technical Debt • New frontend service • Understanding performance is complicated • Management of dependent services was difficult
  13. 13. Scenario 3: High toil
  14. 14. Problem Statement High Toil • Large multi-tenant/ multi-cluster database team • Lack of maturity in team-specific automation • Alert Fatigue
  15. 15. Building a formula for success
  16. 16. Code Yellow
  17. 17. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  18. 18. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  19. 19. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  20. 20. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  21. 21. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  22. 22. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  23. 23. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  24. 24. Key Learnings
  25. 25. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  26. 26. Q&A

×