Successfully reported this slideshow.
Your SlideShare is downloading. ×

Helping operations top-heavy teams the smart way

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 27 Ad

Helping operations top-heavy teams the smart way

Download to read offline

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

Advertisement
Advertisement

More Related Content

Slideshows for you (18)

Similar to Helping operations top-heavy teams the smart way (20)

Advertisement

More from Michael Kehoe (20)

Advertisement

Helping operations top-heavy teams the smart way

  1. 1. Helping operations top-heavy teams the smart way (Lessons from my experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer
  2. 2. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  3. 3. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  4. 4. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not
  5. 5. • How to identify team anti-patterns • How to work through high-toil • How to create sustainable workloads This talk is
  6. 6. Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3 Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A
  7. 7. Background
  8. 8. Personal Experience in the past 15 months ASSISTANCE RENDERED • Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability
  9. 9. Scenario 1: Resource Allocation
  10. 10. Problem Statement Resource Allocations • Lack of written documentation • Backlog of work for clients • Alert Fatigue
  11. 11. Scenario 2: Technical Debt
  12. 12. Problem Statement Technical Debt • New frontend service • Understanding performance is complicated • Management of dependent services was difficult
  13. 13. Scenario 3: High toil
  14. 14. Problem Statement High Toil • Large multi-tenant/ multi-cluster database team • Lack of maturity in team-specific automation • Alert Fatigue
  15. 15. Building a formula for success
  16. 16. Code Yellow
  17. 17. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  18. 18. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  19. 19. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  20. 20. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  21. 21. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  22. 22. Communicate expectations with clients & partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  23. 23. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  24. 24. Key Learnings
  25. 25. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  26. 26. Q&A

Editor's Notes

  • Michael
    So we’re apart of a team at LinkedIn called Production-SRE
    The key tenants of production-sre at LinkedIn is:
    Assist in restoring stability during site-critical issues
    Developing applications to reduce MTTD and MTTR
    Provide direction and guidelines for site-troubleshooting
    Build tools for efficient site-issue troubleshooting, issue detection and correlation

    As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  • This talk isn’t how to magically erase all of your technical debt
    Neither is it a talk on changing your engineering culture
  • This talk is
    How to identify team anti-patterns
    How to work through high-toil
    How to create sustainable workloads
  • Michael
    So we’re apart of a team at LinkedIn called Production-SRE
    The key tenants of production-sre at LinkedIn is:
    Assist in restoring stability during site-critical issues
    Developing applications to reduce MTTD and MTTR
    Provide direction and guidelines for site-troubleshooting
    Build tools for efficient site-issue troubleshooting, issue detection and correlation

    As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  • So the first scenario I want to discuss is when I got pulled into the Traffic team due to severe resource allocation issues:
    We had a team that had a lack of written documentation on how their platform worked and was deployed
    They had a large backlog of work for clients
    And there was a large amount of alert fatigue due to a some poorly defined alerts and some infrastructure that needed upgrading but they hadn’t gotten to it yet
    Ontop of that, 4/5 team members left in a short period of time and started doing reliability operations at another company together

    So we’re in a bit of a pickle here….
    So in response, we took 5 staff SRE’s from other teams and dedicated them to the traffic team for a period of 3 months
    Stopped all non-critical client work for a number of weeks
    Completely recreated all monitoring systems
    Spent a large chunk of time removing complexity
    Focused on infrastructure reliability
  • The second team I worked with was our frontend API service team
  • Thousands of instances
    Lack of maturity in automation for the team
    Alert fatigue given the size of their infrastructure
    Poor visibility into ops metrics

×