Helping operations top-heavy teams the smart way

Helping operations top-heavy
teams the smart way
(Lessons from my experience being loaned out to SRE teams)
Michael Kehoe
Staff Site Reliability Engineer

Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the
University of Queensland

Production-SRE Team @ LinkedIn
$ WHOAMI
• Disaster Recovery - Planning &
Automation
• Incident Response – Process &
Automation
• Visibility Engineering – Making use of
operational data
• Reliability Principles – Defining best
practice & automating it

• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not

• How to identify team anti-patterns
• How to work through high-toil
• How to create sustainable
workloads
This talk is

Today’s
agenda
1 Background
2 Scenario 1: Resource Allocation
3 Scenario 2: Technical Debt
4 Scenario 3: High Toil
5 Building A Formula For Success
6 Key Learnings
7 Q&A

Personal Experience in the past 15 months
ASSISTANCE RENDERED
• Traffic-SRE: Resource Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability

Scenario 1: Resource
Allocation

Problem Statement
Resource Allocations
• Lack of written documentation
• Backlog of work for clients
• Alert Fatigue

Problem Statement
Technical Debt
• New frontend service
• Understanding performance is
complicated
• Management of dependent
services was difficult

Problem Statement
High Toil
• Large multi-tenant/ multi-cluster
database team
• Lack of maturity in team-specific
automation
• Alert Fatigue

Building a formula for
success

Building a formula for success
Define the areas
that need attacking
Problem Statement
Communicate
expectations with
clients & partners
Commutation &
Partnerships
Define success
criteria
Exit Criteria
Get the help that
you require
Resource
Acquisition
Plan for short-term
& long-term
Planning

Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed

Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion

Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources

Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation +
Measurement)

Communicate expectations with
clients & partners
Communicatio
n &
Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders
understand delays & expected
outcomes

When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/

Key Learnings
Measure toil/
overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate

Helping operations top-heavy teams the smart way

Helping operations top-heavy teams the smart way

More Related Content

What's hot

Similar to Helping operations top-heavy teams the smart way

More from Michael Kehoe

Recently uploaded

Helping operations top-heavy teams the smart way

Editor's Notes