SlideShare a Scribd company logo
H E R D I N G C AT S T O A
F I R E F I G H T
T H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M
G . C H A N G
@ G R E Y S C H A L E
E U R U K O 2 0 1 6
T H E Y E A R 1 B . C .
( B E F O R E C AT S )
In the beginning,
there was only
darkness.
But suddenly,
out of the darkness,
there came a sound...
(pager noises)
One person was on-call.
All day.
And night.
Every day.
Every week.
Forever.
(not really, but close enough)
Why not start having a rotation?
"We don't need no stinkin' on-call rotation!"
Bullshit.
"Hi, sorry to be calling at this hour.

I'm from Yammer, I work with _____.

Can I please speak with him?"
Date: Friday, xxth
of XXX, 2013

Time: 03:00 AM GMT -0800
T H E Y E A R 1 A . D .
( A F T E R D I S A S T E R )
How to maths?!?!
• Given:
• Given:
• Given: (1 + 15 ± 5 ) * 2
• ((4 / 1) * ((1 + 15 ± 5) * 2)) = ???
Answer:
How to acronyms?!?!
M T B FM T T RA A RS L A
• MTBF: Mean Time Between Failures
• MTTR: Mean Time To Recovery
• SLA: Service Level Agreement
• AAR: After Action Review
• IR: Incident Report
• OMGWTFBBQAFK
M T B F M T T R
less frequent faster recovery
requires more

stable systems
needs good 

response training
engineers interrupted 

less often
engineers gain

broad knowledge
possibly more 

disastrous issues
possibly more 

frequent issues
• Google Docs Forms
• Yammer Notes
• JIRA
❎ (hard to read reports)
❎ (hard to analyse)
✅ (not perfect...but sort of works)
Hey, we're starting to get this!. . . . . . . . . . . .
Actually, not yet.
• System grows faster than we can learn about it
• Silos appear when you don't share knowledge
• Who's cleaning up this mess, anyway?
• Burnout is real
T H E R E N A I S S A N C E
( G R O W I N G PA I N S )
Do more by doing less
• Split responsibilities by stack
• Added London office for follow-the-sun coverage
• Onboard everybody to the process
• Practice, practice, practice
All hands on deck
• Keep all alerts in a configuration repo
• Managers aren't doing anything, anyway --

make them Incident Managers!
• Runbooks, runbooks everywhere (and a unified one)
• Make the initial response as simple as possible
B A C K T O T H E F U T U R E
( T H E P R E S E N T )
Combined schedules
• Fewer rotations
• Team is unified, so schedules should be too
Post-mortems and retrospectives
• What? Where? Who? Why? How?
• NO blame game
Weekly hand-overs and monthly reviews
• Previous week engineers to current week engineers
• Track top alerts and resolutions (or lack of)
• Focus on the noisiest services
• Timezones are hard
Bi-monthly surveys
• Summarise overall preparedness
• Make sure we're improving
• ...and that nobody is actually burned out
Fix ALL the alerts
• Noisy
• Flaky
• Real
W H E R E A R E T H E C AT S N O W ? !
The end game
• 1 alert per person per day
• Service owners are on-call for those services
• The world is full of kittens!
Isn't on-call just for Ops?
• No
• Responsibility for our code
• Pride in our code
• No pain, no gain
Isn't on-call just for Ops?
• No
• Responsibility for our code
• Pride in our code
• No pain, no gain
After all...
we are all cats being herded.
T H A N K Y O U
@ G R E Y S C H A L E
G . C H A N G
@ G R E Y S C H A L E
E U R U K O 2 0 1 6

More Related Content

Similar to Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018
ITEM
 
Np completeness
Np completenessNp completeness
Np completeness
Muhammad Saim
 
Media Pembelajaran PROCEDURE TEXT.pptx
Media Pembelajaran PROCEDURE TEXT.pptxMedia Pembelajaran PROCEDURE TEXT.pptx
Media Pembelajaran PROCEDURE TEXT.pptx
FansiskoManatar
 
Embedded Systems PPt.pptx
Embedded Systems PPt.pptxEmbedded Systems PPt.pptx
Embedded Systems PPt.pptx
Tabrezahmed39
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
Leon Derczynski
 
Smartphones
SmartphonesSmartphones
Smartphones
Alan Veys
 
Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019
Codemotion
 
Monomobile
MonomobileMonomobile
Monomobile
Jérémie Laval
 

Similar to Voxxed Days Thesaloniki 2016 - Herding cats to a firefight (8)

Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018Cynthia Lee ITEM 2018
Cynthia Lee ITEM 2018
 
Np completeness
Np completenessNp completeness
Np completeness
 
Media Pembelajaran PROCEDURE TEXT.pptx
Media Pembelajaran PROCEDURE TEXT.pptxMedia Pembelajaran PROCEDURE TEXT.pptx
Media Pembelajaran PROCEDURE TEXT.pptx
 
Embedded Systems PPt.pptx
Embedded Systems PPt.pptxEmbedded Systems PPt.pptx
Embedded Systems PPt.pptx
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Smartphones
SmartphonesSmartphones
Smartphones
 
Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019Jason Yee - Chaos! - Codemotion Rome 2019
Jason Yee - Chaos! - Codemotion Rome 2019
 
Monomobile
MonomobileMonomobile
Monomobile
 

More from Voxxed Days Thessaloniki

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thessaloniki
 
Voxxed Days Thessaloniki 2016 - Documentation Avoidance
Voxxed Days Thessaloniki 2016 - Documentation AvoidanceVoxxed Days Thessaloniki 2016 - Documentation Avoidance
Voxxed Days Thessaloniki 2016 - Documentation Avoidance
Voxxed Days Thessaloniki
 
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
Voxxed Days Thessaloniki
 
Voxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in productionVoxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki
 
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 specVoxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
Voxxed Days Thessaloniki
 
Voxxed Days Thesaloniki 2016 - Machine Learning for Developers
Voxxed Days Thesaloniki 2016 - Machine Learning for DevelopersVoxxed Days Thesaloniki 2016 - Machine Learning for Developers
Voxxed Days Thesaloniki 2016 - Machine Learning for Developers
Voxxed Days Thessaloniki
 
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
Voxxed Days Thessaloniki
 
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
Voxxed Days Thessaloniki
 
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on AzureVoxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
Voxxed Days Thessaloniki
 
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki
 

More from Voxxed Days Thessaloniki (10)

Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Voxxed Days Thessaloniki 2016 - Documentation Avoidance
Voxxed Days Thessaloniki 2016 - Documentation AvoidanceVoxxed Days Thessaloniki 2016 - Documentation Avoidance
Voxxed Days Thessaloniki 2016 - Documentation Avoidance
 
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...
 
Voxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in productionVoxxed Days Thessaloniki 2016 - Microservices in production
Voxxed Days Thessaloniki 2016 - Microservices in production
 
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 specVoxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
Voxxed Days Thesaloniki 2016 - Whirlwind tour through the HTTP2 spec
 
Voxxed Days Thesaloniki 2016 - Machine Learning for Developers
Voxxed Days Thesaloniki 2016 - Machine Learning for DevelopersVoxxed Days Thesaloniki 2016 - Machine Learning for Developers
Voxxed Days Thesaloniki 2016 - Machine Learning for Developers
 
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...
 
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
Voxxed Days Thesaloniki 2016 - 5 must have patterns for your web-scale micros...
 
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on AzureVoxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on Azure
 
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...
 

Recently uploaded

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 

Recently uploaded (20)

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 

Voxxed Days Thesaloniki 2016 - Herding cats to a firefight

  • 1. H E R D I N G C AT S T O A F I R E F I G H T T H E E V O L U T I O N O F A N E N G I N E E R I N G O N - C A L L T E A M G . C H A N G @ G R E Y S C H A L E E U R U K O 2 0 1 6
  • 2. T H E Y E A R 1 B . C . ( B E F O R E C AT S )
  • 3. In the beginning, there was only darkness.
  • 4. But suddenly, out of the darkness, there came a sound...
  • 6. One person was on-call. All day. And night. Every day. Every week. Forever.
  • 7. (not really, but close enough)
  • 8. Why not start having a rotation? "We don't need no stinkin' on-call rotation!"
  • 10. "Hi, sorry to be calling at this hour.
 I'm from Yammer, I work with _____.
 Can I please speak with him?" Date: Friday, xxth of XXX, 2013
 Time: 03:00 AM GMT -0800
  • 11. T H E Y E A R 1 A . D . ( A F T E R D I S A S T E R )
  • 12. How to maths?!?! • Given: • Given: • Given: (1 + 15 ± 5 ) * 2 • ((4 / 1) * ((1 + 15 ± 5) * 2)) = ??? Answer:
  • 13.
  • 14. How to acronyms?!?! M T B FM T T RA A RS L A
  • 15. • MTBF: Mean Time Between Failures • MTTR: Mean Time To Recovery • SLA: Service Level Agreement • AAR: After Action Review • IR: Incident Report • OMGWTFBBQAFK
  • 16. M T B F M T T R less frequent faster recovery requires more
 stable systems needs good 
 response training engineers interrupted 
 less often engineers gain
 broad knowledge possibly more 
 disastrous issues possibly more 
 frequent issues
  • 17.
  • 18. • Google Docs Forms • Yammer Notes • JIRA ❎ (hard to read reports) ❎ (hard to analyse) ✅ (not perfect...but sort of works)
  • 19.
  • 20. Hey, we're starting to get this!. . . . . . . . . . . .
  • 21. Actually, not yet. • System grows faster than we can learn about it • Silos appear when you don't share knowledge • Who's cleaning up this mess, anyway? • Burnout is real
  • 22. T H E R E N A I S S A N C E ( G R O W I N G PA I N S )
  • 23. Do more by doing less • Split responsibilities by stack • Added London office for follow-the-sun coverage • Onboard everybody to the process • Practice, practice, practice
  • 24. All hands on deck • Keep all alerts in a configuration repo • Managers aren't doing anything, anyway --
 make them Incident Managers! • Runbooks, runbooks everywhere (and a unified one) • Make the initial response as simple as possible
  • 25. B A C K T O T H E F U T U R E ( T H E P R E S E N T )
  • 26. Combined schedules • Fewer rotations • Team is unified, so schedules should be too
  • 27. Post-mortems and retrospectives • What? Where? Who? Why? How? • NO blame game
  • 28. Weekly hand-overs and monthly reviews • Previous week engineers to current week engineers • Track top alerts and resolutions (or lack of) • Focus on the noisiest services • Timezones are hard
  • 29. Bi-monthly surveys • Summarise overall preparedness • Make sure we're improving • ...and that nobody is actually burned out
  • 30. Fix ALL the alerts • Noisy • Flaky • Real
  • 31. W H E R E A R E T H E C AT S N O W ? !
  • 32. The end game • 1 alert per person per day • Service owners are on-call for those services • The world is full of kittens!
  • 33. Isn't on-call just for Ops? • No • Responsibility for our code • Pride in our code • No pain, no gain
  • 34. Isn't on-call just for Ops? • No • Responsibility for our code • Pride in our code • No pain, no gain
  • 35. After all... we are all cats being herded.
  • 36. T H A N K Y O U @ G R E Y S C H A L E G . C H A N G @ G R E Y S C H A L E E U R U K O 2 0 1 6