Helping operations top-heavy
teams the smart way
(Lessons from my experience being loaned out to SRE teams)
Michael Kehoe
Staff Site Reliability Engineer
Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland
Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it
• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
• How to identify team anti-patterns
• How to work through high-toil
• How to create sustainable
workloads
This talk is
Today’s
agenda
1 Background
2 Scenario 1: Resource Allocation
3 Scenario 2: Technical Debt
4 Scenario 3: High Toil
5 Building A Formula For Success
6 Key Learnings
7 Q&A
Background
Personal Experience in the past 15 months
ASSISTANCE RENDERED
• Traffic-SRE: Resource Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability
Scenario 1: Resource
Allocation
Problem Statement
Resource Allocations
• Lack of written documentation
• Backlog of work for clients
• Alert Fatigue
Scenario 2: Technical Debt
Problem Statement
Technical Debt
• New frontend service
• Understanding performance is
complicated
• Management of dependent
services was difficult
Scenario 3: High toil
Problem Statement
High Toil
• Large multi-tenant/ multi-cluster
database team
• Lack of maturity in team-specific
automation
• Alert Fatigue
Building a formula for
success
Code Yellow
Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Commutation &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning
Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)
Building a formula for success
Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes
Building a formula for success
When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
Key Learnings
Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate
Q&A
Helping operations top-heavy teams the smart way

Helping operations top-heavy teams the smart way

  • 1.
    Helping operations top-heavy teamsthe smart way (Lessons from my experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer
  • 2.
    Michael Kehoe $ WHOAMI •Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  • 3.
    Production-SRE Team @LinkedIn $ WHOAMI • Disaster Recovery - Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  • 4.
    • How toquickly erase all your technical debt • How to change your engineering culture This talk is not
  • 5.
    • How toidentify team anti-patterns • How to work through high-toil • How to create sustainable workloads This talk is
  • 6.
    Today’s agenda 1 Background 2 Scenario1: Resource Allocation 3 Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A
  • 7.
  • 8.
    Personal Experience inthe past 15 months ASSISTANCE RENDERED • Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability
  • 9.
  • 10.
    Problem Statement Resource Allocations •Lack of written documentation • Backlog of work for clients • Alert Fatigue
  • 11.
  • 12.
    Problem Statement Technical Debt •New frontend service • Understanding performance is complicated • Management of dependent services was difficult
  • 13.
  • 14.
    Problem Statement High Toil •Large multi-tenant/ multi-cluster database team • Lack of maturity in team-specific automation • Alert Fatigue
  • 15.
    Building a formulafor success
  • 16.
  • 17.
    Building a formulafor success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  • 18.
    Define the areasthat need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  • 19.
    Define success criteria ExitCriteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  • 20.
    Get the helpyou require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  • 21.
    Plan for theshort-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  • 22.
    Communicate expectations with clients& partners Communicatio n & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  • 23.
    When Operations Isn’tPerfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  • 24.
  • 25.
    Key Learnings Measure toil/ overhead Measure Prioritizeefforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  • 26.

Editor's Notes

  • #4 Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  • #5 This talk isn’t how to magically erase all of your technical debt Neither is it a talk on changing your engineering culture
  • #6 This talk is How to identify team anti-patterns How to work through high-toil How to create sustainable workloads
  • #9 Michael So we’re apart of a team at LinkedIn called Production-SRE The key tenants of production-sre at LinkedIn is: Assist in restoring stability during site-critical issues Developing applications to reduce MTTD and MTTR Provide direction and guidelines for site-troubleshooting Build tools for efficient site-issue troubleshooting, issue detection and correlation As this presentation goes on, you’ll notice how an Event Correlation system fits into these
  • #11 So the first scenario I want to discuss is when I got pulled into the Traffic team due to severe resource allocation issues: We had a team that had a lack of written documentation on how their platform worked and was deployed They had a large backlog of work for clients And there was a large amount of alert fatigue due to a some poorly defined alerts and some infrastructure that needed upgrading but they hadn’t gotten to it yet Ontop of that, 4/5 team members left in a short period of time and started doing reliability operations at another company together So we’re in a bit of a pickle here…. So in response, we took 5 staff SRE’s from other teams and dedicated them to the traffic team for a period of 3 months Stopped all non-critical client work for a number of weeks Completely recreated all monitoring systems Spent a large chunk of time removing complexity Focused on infrastructure reliability
  • #13 The second team I worked with was our frontend API service team
  • #15 Thousands of instances Lack of maturity in automation for the team Alert fatigue given the size of their infrastructure Poor visibility into ops metrics