Anything that can go wrong will go wrong! That’s how Murphy’s law states that the fact that outages are inevitable and systems often misbehave.
As developers, we are working hard to build reliable and scalable systems, and it’s our job to keep the ship floating and the services up. In this session, we will talk about fires, how to put them out, and how to be ready for them. We will discuss abuser stories, degradation of service, and dependencies management as possible techniques to fight fires. We will discuss these techniques through some war stories and how they helped or could have helped service owners.
Agenda:
- What brought me here?
- What is fire?
- Firefighting Vs. The dev team
- How do you end up with fire?
- Your dependencies will fail.
- Feature toggles are your friends
- Abuser stories
3. Agenda
• What to expect?
• What brought me here?
• What is a fire?
• Fire-fighting vs. The team
• Mapping out dependencies
• Feature toggles
• Abuser Stories
• The unknown unknowns!
4. What to
expect?
Why fires aren't only harming
your SLIs?
A pro-active approach to fire-
fighting.
What could they have done
better? (War stories)
5. What brought
me here?
• Same process, different problem
• The scale matters
• Less fire-fighting = Higher productivity
7. What is a fire?
Anything that would require an emergency re-
allocation of resources.
IOW, anything that would require you to drop
whatever you have in hands right now and start
working on.
8. Why is it a
problem?
Disturb the planned work
Hero culture
Stressful for fighters
Patches make it chronic
9. How does it
happen?
• Cutting corners while solving problems
• Too many problems and no enough time
• Allowing others control the project agenda
• No priorities, everything is urgent
16. War stories?
• As an abuser, I'd like to live stream violent
content(video) and make it go viral. - Facebook
• As an abuser, I want to post social engineering-
based scams and make them go viral. - Twitter
17. There are known knowns. There
are things we know we know. We
also know there are known
unknowns. That is to say, we
know there are some things we
do not know. But there are also
unknown unknowns, the ones we
don't know we don't know.
Donald
Rumsfeld
- Same agile process with the same mistakes, skipping retro, too many meetings, missing standup and even no proper estimation.
- Fires now are caused by the live traffic and bad code rolled out unlike before for me where it was a new requirement from a potential customer
- The new scale makes a fire actually feels like a fire, it's visualized, it's big and it's its own type of stress
- I love and hate firefighting, it gives me the rush I need to be excited but if it becomes a fulltime job it's very exhausting, stressful and leads to a half-baked piece of software with tons of patches.
- Takes time from the sprint, pushes timelines ..
- Tasks traffic, how many added vs how many we burn
- Hero culture messes up the reward/compensation model, leads to stressed employees, no WLB, etc.
- Rushed patches, leads to other bugs that will group with existing bugs and will become a fire within months and then it becomes chronic
Facebook iOS SDK tookdown Spotify, TikTok
Focus on user activity stream as a dependency.
If it went down, we can work without the tiny green light, we can just ignore that code path of handle the failures gracefully
Refer to Martin Fowlers' toggles and the idea that if possible for feature releases we should break down the feature and make the toggle the last option compared to long term feature toggles as a degradation of service technique.
Even if they use the same framework to flip the switch, we still need to make that logical distinction
Focus on stored card as a feature that you want to disable and enable manually as you might use it in case your suspect a malicious attack or a technical problem like an outage in your auth service.
Looking at the story from a different perspective really helps with fleshing out the task.
It really helps if you think about it before hand, you already have the fire pressure.
It could have helped if they were ready for that kind of abusers.
- Facebook could have shut down the video much faster.
- Twitter could have been able to use some keywords to block a tweet from being viral (which what they did, but it took them some time).
Everything can happen at the same time, without proper monitoring you are fling blindly and have 0 visibility over your system. Think of a moment where you are seeing a drop in sales but you don't know why, after a while you realise that CS inbound is only from Europe and North Africa, after that you start looking into this and you find that the server has ran out of space because of logs and it wasn't written in a way to handle that gracefully.
- Timezones?
- epochs?
- MM-DD or DD-MM ?
- Don't forget to hydrate
- It can be stressful and long, so try to stay calm
- "The simplest explanation is most likely the right one" Occam's razor
- Follow your gut and look for evidence