Helping operations top-heavy teams the smart way

Michael Kehoe
Michael KehoeArchitect of reliable, scalable infrastructure at LinkedIn
Helping operations top-heavy teams
the smart way
Jeff Weiner
Chief Executive Officer
Michael Kehoe
Staff Site Reliability Engineer
Todd Palino
Sr Staff Site Reliability Engineer
This Is The Only Slide You May Need a Picture Of
slideshare.net/ToddPalino slideshare.net/MichaelKehoe3
Michael Kehoe
$ WHOAMI
• Staff Site Reliability Engineer @ LinkedIn
• Production-SRE Team
• Funny accent = Australian + 4 years
American
• Former Network Engineer at the University
of Queensland
Todd Palino
$ WHOAMI
• Senior Staff SRE @ LinkedIn
• Capacity Engineering Team
• Co-Author of Kafka: The Definitive Guide
• Late of VeriSign Infrastructure Engineering
When Operations Isn’t Perfect
Code Yellow
https://devops.com/code-yellow-when-operations-isnt-perfect/
• How to quickly erase all your
technical debt
• How to change your engineering
culture
This talk is not
• How to identify team anti-patterns
• How to work through high toil
• How to create sustainable workloads
This talk is
Today’s
agenda
1 Background
2 Scenario 1: Traffic-SRE
3 Scenario 2: Kafka-SRE
4 Building A Formula For Success
5 Key Learnings
6 Q&A
Background
Personal Experience in the past 19 months
ASSISTANCE RENDERED
• Traffic-SRE: Technical Debt/ Resource
Allocation
• Voyager-SRE: Technical Debt
• Capacity War-room
• Espresso-SRE: Reliability
• Kafka-SRE: Capacity and Alert Fatigue
Scenario 1: Traffic-SRE
Problem Statement
Technical Debt
• Written documentation needed
improvement
• Deployment infrastructure needed
investment
• Alert Fatigue
Traffic-SRE
Problem Statement
Resource Allocations
• Backlog of work for clients
• Staff shortage
Scenario 2: Kafka
Helping operations top-heavy teams the smart way
Problem Statement
Capacity Planning
• Multi-tenant Infrastructure
• No resource controls
• Unclear resource ownership
• Ad-hoc capacity planning
• Sudden 100% increase in traffic
Problem Statement
Alert Fatigue
• Multiple applications overutilized
• No time for proactive work
• Most alerts non-actionable
Building a formula for
success
Code Yellow
Building a formula for success
Define the areas that
need attacking
Problem Statement
Communicate
expectations with
clients & partners
Communication &
Partnerships
Define success
criteria
Exit Criteria
Get the help that you
require
Resource Acquisition
Plan for short-term &
long-term
Planning
Define the areas that need attacking
Problem Statement
• Admit there is a problem
• Measure the problem
• Understand the problem
• Determines underlying causes that
need to be fixed
Building a formula for success
Define success criteria
Exit Criteria
• Define concrete goals
• Define concrete success criteria
• Measure via an operational metric
• Measure via a project being
completed
• Define timelines for completion
Building a formula for success
Get the help you require
Resource Acquisition
• Ask other teams for help
• Get dedicated engineers/ project
managers/ other roles as required
• Set exit-date for resources
Building a formula for success
Plan for the short-term & long-term
Planning
• Plan out short-term work
• Plan out longer-term projects
• Do they need to be rescheduled?
• Prioritize work that will reduce toil &
burnout (Automation + Measurement)
Building a formula for success
Communicate expectations with clients
& partners
Communication
& Partnerships
• Communicate problem statement &
exit criteria
• Send regular progress updates
• Ensure that stakeholders understand
delays & expected outcomes
Building a formula for success
Key Learnings
Key Learnings
Measure toil/ overhead
Measure
Prioritize efforts to
remove overhead/toil
Prioritize
Communicate with
partners & teams
Communicate
Q&A
Helping operations top-heavy teams the smart way
1 of 29

Recommended

Code Yellow: Helping operations top-heavy teams the smart way by
Code Yellow: Helping operations top-heavy teams the smart wayCode Yellow: Helping operations top-heavy teams the smart way
Code Yellow: Helping operations top-heavy teams the smart wayMichael Kehoe
140 views29 slides
Agile foundation and agile myths by
Agile foundation and agile mythsAgile foundation and agile myths
Agile foundation and agile mythsDennis Stevens
1.9K views24 slides
Lazar Milovic - No estimates by
Lazar Milovic - No estimatesLazar Milovic - No estimates
Lazar Milovic - No estimatesSerbian Product Community
88 views10 slides
User Story Cycle Time - An Universal Agile Maturity Measurement by
User Story Cycle Time - An Universal Agile Maturity MeasurementUser Story Cycle Time - An Universal Agile Maturity Measurement
User Story Cycle Time - An Universal Agile Maturity MeasurementEthan Huang
5.3K views22 slides
Identifying and measuring testing debt by
Identifying and measuring testing debtIdentifying and measuring testing debt
Identifying and measuring testing debtPeter Varhol
358 views28 slides
Fundamentals of Agile by
Fundamentals of AgileFundamentals of Agile
Fundamentals of AgileZülfikar Karakaya
618 views13 slides

More Related Content

What's hot

BoS2015 Trish Khoo – Engineering Manager, Google by
BoS2015 Trish Khoo – Engineering Manager, GoogleBoS2015 Trish Khoo – Engineering Manager, Google
BoS2015 Trish Khoo – Engineering Manager, GoogleBusiness of Software Conference
3.1K views29 slides
Agile strategy by
Agile strategyAgile strategy
Agile strategyJonathan Mor
195 views20 slides
DevOps By The Numbers by
DevOps By The NumbersDevOps By The Numbers
DevOps By The NumbersXebiaLabs
518 views20 slides
State of continuous delivery in 2015 - Minsk 15-5-2015 by
State of continuous delivery in 2015 - Minsk 15-5-2015State of continuous delivery in 2015 - Minsk 15-5-2015
State of continuous delivery in 2015 - Minsk 15-5-2015Pavel Chunyayev
534 views55 slides
Pm training day 4 by
Pm training   day 4Pm training   day 4
Pm training day 4Wasim Khalil PMP®,PRINCE2P®,ITIL®
427 views21 slides
Anton Muzhailo - Practical Test Process Improvement using ISTQB by
Anton Muzhailo - Practical Test Process Improvement using ISTQBAnton Muzhailo - Practical Test Process Improvement using ISTQB
Anton Muzhailo - Practical Test Process Improvement using ISTQBIevgenii Katsan
236 views29 slides

What's hot(20)

DevOps By The Numbers by XebiaLabs
DevOps By The NumbersDevOps By The Numbers
DevOps By The Numbers
XebiaLabs518 views
State of continuous delivery in 2015 - Minsk 15-5-2015 by Pavel Chunyayev
State of continuous delivery in 2015 - Minsk 15-5-2015State of continuous delivery in 2015 - Minsk 15-5-2015
State of continuous delivery in 2015 - Minsk 15-5-2015
Pavel Chunyayev534 views
Anton Muzhailo - Practical Test Process Improvement using ISTQB by Ievgenii Katsan
Anton Muzhailo - Practical Test Process Improvement using ISTQBAnton Muzhailo - Practical Test Process Improvement using ISTQB
Anton Muzhailo - Practical Test Process Improvement using ISTQB
Ievgenii Katsan236 views
Dare to Explore: Discover ET! by Raj Indugula
Dare to Explore: Discover ET!Dare to Explore: Discover ET!
Dare to Explore: Discover ET!
Raj Indugula3.8K views
Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1... by Cprime
Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1...Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1...
Optimize Portfolio Performance with Simple Agile Techniques and Jira - Part 1...
Cprime251 views
LeanKit Webinar: Managing Complex Workflows by hscrume
LeanKit Webinar: Managing Complex WorkflowsLeanKit Webinar: Managing Complex Workflows
LeanKit Webinar: Managing Complex Workflows
hscrume197 views
Agile Fundamentals by Graham Dick
Agile FundamentalsAgile Fundamentals
Agile Fundamentals
Graham Dick1.2K views
Agile Project Development by Hajrah Jahan
Agile Project DevelopmentAgile Project Development
Agile Project Development
Hajrah Jahan236 views
Scaling on Atlassian: Avoiding The Top 5 Pitfalls When Migrating From a Legac... by Cprime
Scaling on Atlassian: Avoiding The Top 5 Pitfalls When Migrating From a Legac...Scaling on Atlassian: Avoiding The Top 5 Pitfalls When Migrating From a Legac...
Scaling on Atlassian: Avoiding The Top 5 Pitfalls When Migrating From a Legac...
Cprime698 views
Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014 by Codecamp Romania
Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014
Oana Feidi - Debugging - Root cause analysis - CodeCamp-10-may-2014
Codecamp Romania530 views
Agile and waterfall by John Morse
Agile and waterfallAgile and waterfall
Agile and waterfall
John Morse20.5K views
All You Want To About Kanban Before Doing Kanban Certification | AgileFever by AgileFever
All You Want To About Kanban Before Doing Kanban Certification | AgileFeverAll You Want To About Kanban Before Doing Kanban Certification | AgileFever
All You Want To About Kanban Before Doing Kanban Certification | AgileFever
AgileFever35 views
Understanding the Relationship between Lean, Agile, and DevOps: Jon's Slides by LeanKit
Understanding the Relationship between Lean, Agile, and DevOps: Jon's SlidesUnderstanding the Relationship between Lean, Agile, and DevOps: Jon's Slides
Understanding the Relationship between Lean, Agile, and DevOps: Jon's Slides
LeanKit1.1K views
Software Advice UserView: Agile Project Management Report 2015 by Software Advice
Software Advice UserView: Agile Project Management Report 2015Software Advice UserView: Agile Project Management Report 2015
Software Advice UserView: Agile Project Management Report 2015
Software Advice1.1K views

Similar to Helping operations top-heavy teams the smart way

Helping operations top-heavy teams the smart way by
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart wayMichael Kehoe
233 views27 slides
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te... by
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...
BoS2015 Jeff Szczepanski – COO, Stack Exchange - Stack Overflow. Scaling a Te...Business of Software Conference
2.5K views50 slides
Applying both of waterfall and iterative development by
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative developmentDeny Prasetia
533 views14 slides
Success recipe for new IT projects-Agile way. Fail Fast, Fail Early by
Success recipe for new IT projects-Agile way. Fail Fast, Fail EarlySuccess recipe for new IT projects-Agile way. Fail Fast, Fail Early
Success recipe for new IT projects-Agile way. Fail Fast, Fail EarlyJoseph Vargheese PMP CSM CSP
1.1K views31 slides
The Dashlane Agile Journey by
The Dashlane Agile JourneyThe Dashlane Agile Journey
The Dashlane Agile JourneyDashlane
2.9K views36 slides
INAAU Project Management for Telecommunications Professionals by
INAAU Project Management for Telecommunications ProfessionalsINAAU Project Management for Telecommunications Professionals
INAAU Project Management for Telecommunications ProfessionalsRory McKenna
1.7K views30 slides

Similar to Helping operations top-heavy teams the smart way(20)

Helping operations top-heavy teams the smart way by Michael Kehoe
Helping operations top-heavy teams the smart wayHelping operations top-heavy teams the smart way
Helping operations top-heavy teams the smart way
Michael Kehoe233 views
Applying both of waterfall and iterative development by Deny Prasetia
Applying both of waterfall and iterative developmentApplying both of waterfall and iterative development
Applying both of waterfall and iterative development
Deny Prasetia533 views
The Dashlane Agile Journey by Dashlane
The Dashlane Agile JourneyThe Dashlane Agile Journey
The Dashlane Agile Journey
Dashlane2.9K views
INAAU Project Management for Telecommunications Professionals by Rory McKenna
INAAU Project Management for Telecommunications ProfessionalsINAAU Project Management for Telecommunications Professionals
INAAU Project Management for Telecommunications Professionals
Rory McKenna1.7K views
Doing It On Your Own: When to Call in the Consultants, When to Leave Them Out by NTEN
Doing It On Your Own: When to Call in the Consultants, When to Leave Them OutDoing It On Your Own: When to Call in the Consultants, When to Leave Them Out
Doing It On Your Own: When to Call in the Consultants, When to Leave Them Out
NTEN325 views
American Electric Power Ercot kickoff by John Napier
American Electric Power Ercot kickoffAmerican Electric Power Ercot kickoff
American Electric Power Ercot kickoff
John Napier676 views
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges by AVATA
AVATA Webinar: Solutions to Common Demantra & ASCP ChallengesAVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA Webinar: Solutions to Common Demantra & ASCP Challenges
AVATA808 views
Kristian Fischer - Put Test in the Driver's Seat by TEST Huddle
Kristian Fischer - Put Test in the Driver's SeatKristian Fischer - Put Test in the Driver's Seat
Kristian Fischer - Put Test in the Driver's Seat
TEST Huddle788 views
Process improvement scrum_agile_v2_by_david_mann by Jim Sutter
Process improvement scrum_agile_v2_by_david_mannProcess improvement scrum_agile_v2_by_david_mann
Process improvement scrum_agile_v2_by_david_mann
Jim Sutter558 views
Scrum Agile by David Mann by James Sutter
 Scrum Agile by David Mann Scrum Agile by David Mann
Scrum Agile by David Mann
James Sutter1.5K views
Engineering Teams and Systems for Velocity by Jean Barmash
Engineering Teams and Systems for VelocityEngineering Teams and Systems for Velocity
Engineering Teams and Systems for Velocity
Jean Barmash586 views
Fundamentals of agile tntu (2015-04-27) by Oleg Nazarevych
Fundamentals of agile   tntu (2015-04-27)Fundamentals of agile   tntu (2015-04-27)
Fundamentals of agile tntu (2015-04-27)
Oleg Nazarevych414 views
Tackling the Fallacy of Agile by BSGAfrica
Tackling the Fallacy of Agile Tackling the Fallacy of Agile
Tackling the Fallacy of Agile
BSGAfrica669 views
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement... by TheConnectedCause
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
CRMready Webinar Series - Part 3 - How to Make Your Nonprofit’s CRM Implement...
TheConnectedCause1.8K views

More from Michael Kehoe

eBPF Workshop by
eBPF WorkshopeBPF Workshop
eBPF WorkshopMichael Kehoe
1.4K views26 slides
eBPF Basics by
eBPF BasicseBPF Basics
eBPF BasicsMichael Kehoe
2.7K views63 slides
QConSF 2018: Building Production-Ready Applications by
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready ApplicationsMichael Kehoe
193 views43 slides
AllDayDevops: What the NTSB teaches us about incident management & postmortems by
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortemsMichael Kehoe
321 views58 slides
Linux Container Basics by
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
510 views29 slides
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops by
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsMichael Kehoe
285 views43 slides

More from Michael Kehoe(20)

QConSF 2018: Building Production-Ready Applications by Michael Kehoe
QConSF 2018: Building Production-Ready ApplicationsQConSF 2018: Building Production-Ready Applications
QConSF 2018: Building Production-Ready Applications
Michael Kehoe193 views
AllDayDevops: What the NTSB teaches us about incident management & postmortems by Michael Kehoe
AllDayDevops: What the NTSB teaches us about incident management & postmortemsAllDayDevops: What the NTSB teaches us about incident management & postmortems
AllDayDevops: What the NTSB teaches us about incident management & postmortems
Michael Kehoe321 views
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops by Michael Kehoe
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet DropsPapers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops
Michael Kehoe285 views
What the NTSB teaches us about incident management & postmortems by Michael Kehoe
What the NTSB teaches us about incident management & postmortemsWhat the NTSB teaches us about incident management & postmortems
What the NTSB teaches us about incident management & postmortems
Michael Kehoe489 views
PyBay 2018: Production-Ready Python Applications by Michael Kehoe
PyBay 2018: Production-Ready Python ApplicationsPyBay 2018: Production-Ready Python Applications
PyBay 2018: Production-Ready Python Applications
Michael Kehoe283 views
The Next Wave of Reliability Engineering by Michael Kehoe
The Next Wave of Reliability EngineeringThe Next Wave of Reliability Engineering
The Next Wave of Reliability Engineering
Michael Kehoe687 views
Building Production-Ready Microservices: DevopsExchangeSF by Michael Kehoe
Building Production-Ready Microservices: DevopsExchangeSFBuilding Production-Ready Microservices: DevopsExchangeSF
Building Production-Ready Microservices: DevopsExchangeSF
Michael Kehoe452 views
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine... by Michael Kehoe
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engine...
Michael Kehoe321 views
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at... by Michael Kehoe
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at...
Michael Kehoe270 views
SRECon-Europe-2017: Networks for SREs by Michael Kehoe
SRECon-Europe-2017: Networks for SREsSRECon-Europe-2017: Networks for SREs
SRECon-Europe-2017: Networks for SREs
Michael Kehoe383 views
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale by Michael Kehoe
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scaleVelocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale
Michael Kehoe247 views
Reducing MTTR and False Escalations: Event Correlation at LinkedIn by Michael Kehoe
Reducing MTTR and False Escalations: Event Correlation at LinkedInReducing MTTR and False Escalations: Event Correlation at LinkedIn
Reducing MTTR and False Escalations: Event Correlation at LinkedIn
Michael Kehoe955 views
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ... by Michael Kehoe
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at ...
Michael Kehoe534 views
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn by Michael Kehoe
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedInCouchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Couchbase Connect 2016: Monitoring Production Deployments The Tools – LinkedIn
Michael Kehoe720 views
Using SaltStack to Auto Triage and Remediate Production Systems by Michael Kehoe
Using SaltStack to Auto Triage and Remediate Production SystemsUsing SaltStack to Auto Triage and Remediate Production Systems
Using SaltStack to Auto Triage and Remediate Production Systems
Michael Kehoe1.8K views
SRECon USA 2016: Growing your Entry Level Talent by Michael Kehoe
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
Michael Kehoe520 views

Recently uploaded

Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Tsuyoshi Horigome
18 views16 slides
CHEMICAL KINETICS.pdf by
CHEMICAL KINETICS.pdfCHEMICAL KINETICS.pdf
CHEMICAL KINETICS.pdfAguedaGutirrez
7 views337 slides
IWISS Catalog 2022 by
IWISS Catalog 2022IWISS Catalog 2022
IWISS Catalog 2022Iwiss Tools Co.,Ltd
24 views66 slides
LFA-NPG-Paper.pdf by
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfharinsrikanth
40 views13 slides
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...Anowar Hossain
10 views34 slides
SEMI CONDUCTORS by
SEMI CONDUCTORSSEMI CONDUCTORS
SEMI CONDUCTORSpavaniaalla2005
19 views8 slides

Recently uploaded(20)

Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by Anowar Hossain
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
Anowar Hossain10 views
MSA Website Slideshow (16).pdf by msaucla
MSA Website Slideshow (16).pdfMSA Website Slideshow (16).pdf
MSA Website Slideshow (16).pdf
msaucla39 views
7_DVD_Combinational_MOS_Logic_Circuits.pdf by Usha Mehta
7_DVD_Combinational_MOS_Logic_Circuits.pdf7_DVD_Combinational_MOS_Logic_Circuits.pdf
7_DVD_Combinational_MOS_Logic_Circuits.pdf
Usha Mehta50 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1245 views
Multi-objective distributed generation integration in radial distribution sy... by IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 views
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... by ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 views
What is Whirling Hygrometer.pdf by IIT KHARAGPUR
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdf
IIT KHARAGPUR 11 views
13_DVD_Latch-up_prevention.pdf by Usha Mehta
13_DVD_Latch-up_prevention.pdf13_DVD_Latch-up_prevention.pdf
13_DVD_Latch-up_prevention.pdf
Usha Mehta9 views
2_DVD_ASIC_Design_FLow.pdf by Usha Mehta
2_DVD_ASIC_Design_FLow.pdf2_DVD_ASIC_Design_FLow.pdf
2_DVD_ASIC_Design_FLow.pdf
Usha Mehta14 views

Helping operations top-heavy teams the smart way

  • 1. Helping operations top-heavy teams the smart way Jeff Weiner Chief Executive Officer Michael Kehoe Staff Site Reliability Engineer Todd Palino Sr Staff Site Reliability Engineer
  • 2. This Is The Only Slide You May Need a Picture Of slideshare.net/ToddPalino slideshare.net/MichaelKehoe3
  • 3. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @ LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  • 4. Todd Palino $ WHOAMI • Senior Staff SRE @ LinkedIn • Capacity Engineering Team • Co-Author of Kafka: The Definitive Guide • Late of VeriSign Infrastructure Engineering
  • 5. When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/
  • 6. • How to quickly erase all your technical debt • How to change your engineering culture This talk is not
  • 7. • How to identify team anti-patterns • How to work through high toil • How to create sustainable workloads This talk is
  • 8. Today’s agenda 1 Background 2 Scenario 1: Traffic-SRE 3 Scenario 2: Kafka-SRE 4 Building A Formula For Success 5 Key Learnings 6 Q&A
  • 10. Personal Experience in the past 19 months ASSISTANCE RENDERED • Traffic-SRE: Technical Debt/ Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability • Kafka-SRE: Capacity and Alert Fatigue
  • 12. Problem Statement Technical Debt • Written documentation needed improvement • Deployment infrastructure needed investment • Alert Fatigue Traffic-SRE
  • 13. Problem Statement Resource Allocations • Backlog of work for clients • Staff shortage
  • 16. Problem Statement Capacity Planning • Multi-tenant Infrastructure • No resource controls • Unclear resource ownership • Ad-hoc capacity planning • Sudden 100% increase in traffic
  • 17. Problem Statement Alert Fatigue • Multiple applications overutilized • No time for proactive work • Most alerts non-actionable
  • 18. Building a formula for success
  • 20. Building a formula for success Define the areas that need attacking Problem Statement Communicate expectations with clients & partners Communication & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  • 21. Define the areas that need attacking Problem Statement • Admit there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  • 22. Define success criteria Exit Criteria • Define concrete goals • Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  • 23. Get the help you require Resource Acquisition • Ask other teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  • 24. Plan for the short-term & long-term Planning • Plan out short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  • 25. Communicate expectations with clients & partners Communication & Partnerships • Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  • 27. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove overhead/toil Prioritize Communicate with partners & teams Communicate
  • 28. Q&A