Have you ever felt you took every wrong turn possible in the process of mitigating a production incident? Did you go through a 3-hour hell during incident response and felt the incident wasn’t complex enough to justify the horrors you’ve experienced? Did it cause you to question your engineering or problem-solving skills?
Well, it’s only partially you. Our brain is wired to make decision-making simpler. In doing so, it exposes itself to biases, heuristics, and other quirks that may seem like “bad decisions” in hindsight.
In this talk, through real-life outages, we’ll project those psychological principles onto the world of production monitor, and incident management. As a responder, you’ll learn why those behavioral patterns emerge during production incidents and what can be done to limit their effect, and as a manager, you’ll learn how to enable and encourage a healthy environment to better support those patterns.
2. Boris Cherkasky
➔ Backend engineer and Production
advocate @Riskified
➔ I 🤍 Observability
➔ Scuba Diver
About me
@cherkaskyb on twitter / linkedin / medium
3. Agenda
01 The psychology of
an incident response
02 Intro to cognitive biases
and heuristics
03 Biases in production
4. Riskified by the numbers
Global team,
nearly 50% in R&D
Countries across
the globe
Online volume
reviewed in 2020
650+ 180+
$60B+
50+
Publicly held companies
among our clients
98%+
Client retention
for the past 2 years
As of August 2021
23. Alerts and metrics should be set
by “the common” responder,
mentored by the expert
Mitigating Curse of Knowledge
When complex alerts can’t be
avoided - document, explain, train,
level UP your organization
29. Mitigating the Simulation heuristic
Set the responder on the correct path as
soon as possible, with minimal friction
Minimize the time
to start triage
33. Show simple and standardized data
Mitigating the confirmation bias
34. Show simple and standardized data
Mitigating the confirmation bias
35. ● Don’t work alone
● Draw a concrete line between the observed
facts, your hypothesis, and the existing state
(outcome/outage)
Mitigating the confirmation bias
Hey everyone!
I wanna take you to one the most memorable evenings i had as a on call engineer. It was a Tuesday, I am at home, watching some movie with a warm pizza in my hand, when i get an alert from our production system.
This alert turned that warm Tuesday night to a four hour long nightmare.
At 1AM at night, i look back at the incident, after everything is back to normal, and my only though is - this could have been solved in 20 minutes.
My greatest lesson from that Tuesday was that even the most experienced responder can make a wrong decisions.
Those wrong decisions, and what causes them, are the topic of this talk.
So thank you for having me,
I’m Boris Cherkasky, and I’ve been breaking, fixing, and monitoring production systems, for the last four years.
Im a backend engineer and production advocate at Riskified, and i am generally fascinated by observability.
I love SCUBA diving and I write small tech blog that you can find using my handle is cherkaskyb on most social networks and medium
This talk is a journey into how our minds work, and how it “plays tricks” on us during production incidents.
We’ll start with some anatomy of the incident response process,
Then we’ll do a short introduction to cognitive biases and heuristics,
And the majority of this talk will focus on real life incident examples. In those examples we’ll cover how cognitive biases gets manifested in production incidents, and how we can mitigate those effectively.
A few words about Riskified.Riskified enables top brands to fulfill their maximal e-commerce potential, by leveraging AI to help with fraud prevention, and other financial funnel optimisations.
About 60 billion dollars worth of orders from all around the world goes through our systems
Let’s get back to the previously mentioned Tuesday.
The incident started with total uncertainty - I got an alert from one of our most reliable systems, one that malfunctioned only once in the last four years.
I later learned, that this uncertainty, is what allows psychological biases to thrive, and affect our decision making.
Let’s make an hypothetical chart of certainty over time during my incident, and i’ll walk you through the decisions made in those four hours.
I first decided to go to a dashboard and not Logs, it’s a decision. It happens to be a good decision - I see some irregularities! my certainty grows. I then get a message from a colleague who reports some partial information that contradicts what i am seeingour certainty plumits.
We scratch our heads for a minute, and open a more specific dashboard! It puts us on the path that one of our dependencies is imparied.our certainty is grows.
We think we know what’s the root cause, and we DECIDE to restart an instance - it doesn’t help.
our certainty plumits once again
And this goes on until the incident is resolved four hours later - a ping pong game between certainty, and uncertainty.
Because of this uncertainty, in each decision point, we are prone to cognitive biases that suppose to ease our decision making process, but in fact, might cause us make the wrong decisions.
Let’s begin our journey with defining what heuristics and biases are
Since i'm gonna be talking about psychology, i am first legally obligated to mention that i am not a trained psychiatrist.
When talking about decision making under uncertainty, two terms come to mind - heuristics and cognitive biases.
To simplify this talk i’ll address them both - “biases”, and the actual difference between them is not critical.
Biases are mental patterns - shortcuts in decision making our mind makes to simplify this complex task.
A good analogy for a bias is a branch prediction in the CPU - it’s a “calculated shortcut” the CPU makes - it can work, and is working many times, but when it doesn’t - a bad call was made, and we need to roll back to the correct state.
Let’s start of with an example, A Radio commercial might state:
Boris insurance inc. offers a Loan at an interest rate of 0.5% lower than the bank’s
even that one small sentence has biases in it, put there on purpose to help you make the complex “loan decision”
In this example - It’s The anchoring bias - a cognitive bias where an individual's decisions are influenced by a particular reference point or 'anchor'
Our mind is now anchored to interest the bank is offering - every decision we’ll be making is anchored to this reference point, regardless if that’s a good reference point or not.
And as you can probably guess - It’s probably not.
But this was a commercial, What happens in production? What biases are we prone to there?
Before we dive into this
We have to understand that biases will come to life in our weakest point - an incident. There’s no way around it.
What we can do, is to limit the effect, or volume of those biases, by preparing for them early on in our development process.
by:
Designing “bias proof” systems
Maintaining “bias aware” environment with each change we deliver <PAUSE> using effective alerting and monitoring.
Create bias reducing response procedure
By applying the measures i’m gonna discuss, you can benefit in faster incident resolution, lower frustration within the response team, and maybe even get back to your pizza, while it’s still warm.
The incidents you’re about to see were managed by trained professionals, do not try them at home (or work).
Each example im about to cover is a real life incident where our responders got blind sided by a bias that affected their behaviour.
The first incident, is one where of our most critical data sources failed. The backup didn’t work well enough, and our whole business process came to an halt.
I’m sure you have a business process similar to ours in your system too - one that is composed of steps, and each one of those can fail
One of the responders suggested we disabled the datasource and run without it, <CLICK>
in the “fog of war” it sounded like a solid idea, since it’ll get the service back to life.
The first time we almost turned it off, 30 minutes into the incident, one of the Analysts mentioned it’ll breach SLA for one of our top customers.
The second time we almost turned it off, 50 minutes into the incident,it was head of engineering that mentioned it’ll cause downstream pressure on other systems.
The more this idea floated the room, more stakeholders spoke up on the impact it’ll cause - from accuracy and latency, to legal and operational.
This. was. Just. frustrating. No one cloud make the call, while In the background - the whole business in impared.
About two hours into the incident, the idea escalated to the Chief Of Operations, who made the decision - we can’t turn it off, and we waited for the underlying issue to resolve.
To clarify - we had 2 options
First - wait for the underlying cause to resolve - and have no service until then.
Second - turn off the failing service, which will cause multiple issues around accuracy, SLA, and other.
Both are Bad.
But NOT EQUALLY, we just didn’t know which one is worse, and needed the Chief of Operations to decide.
Our response team was paralised, the decision couldn’t be made.
This is the Analysis Paralysis - It manifested in our inability to make the call to turn off the data source.
Analysis Paralysis is a psychological effect where The more knowledge we have - the harder it is to make a decision - all the alternatives and outcomes are being weighed, without coming to a decision.
In our incident, the whole business was impared, therefore the response team was huge, with additional stakeholders flooding the room.
It took 2 hours, with more than 10 people involved to make that decision.
the response team wasn’t independent in it’s decision making.
So, How can we give my response team it’s independence back?
It’s not always possible,
but if we start at the requirements and design phase, we can define an order of importance for our system SLIs - A pyramid.
When the priority is explicitly defined, The SLIs at top of the pyramid will be sacrificed to secure the SLIs at the base of the pyramid and the response team is free to independently make fast decisions to mitigate degradations.
In our case, the pyramid would have stated that the most important SLI would have been accuracy, so we’d know we can’t sacrifice accuracy for availability.
This example shows how a process of “bias proofing” in the design phase, can later on help mitigate those effects in production incidents.
We’ve touched what we can do in the design phase, let’s now see how we deliver features, and define alerts on them.
This example is around a single alert. This one <point up>I’ll give you a few seconds to look at it, it’s a Pseudo Prometheus QL.
Now, let me have your attention again.You’re now as confused as I was.
It’s 10 in the evening and all i know is that there is an “issue with the service’s latency”, and im in what appears to be a math lesson!
It’s been 10 years since i last saw standard deviations, and i have no clue what are 2 standard deviations. By raise of hand, how many of you know what two standard deviations are? CLICK
I start this incident with google and wikipedia to understand what this alert means.
How did we get here? How come this alert find it’s way to production when i have no idea what to do with it?
My surprising midnight math lessing is a result of a bias called ”curse of knowledge”. And it has 3 main effects:
The first effect is The tendency to assume knowledge one possess, is common knowledge - The author of my alert thought it’s basic knowledge that 2 standard deviations in uniform distribution is the 96’th percentile, so my incident in fact, is p96 latency increase.
The second effect is The lack of ability to rollback to your “unknowing” state - this is why teaching is so hard, and one of the reasons getting new shifters to be experienced and confidant is complex.
The third effect is Predicting another person's action is highly biased towards one's knowledge of the issue - this is why writing run books is hard. Runbooks are documented recipes for mitigating and incident. The course of knowledge is why many run books have “missing steps” and implicit knowledge.
How can we mitigate this? We obviously want all our responders to be experts! Knowledge is a good thing!
To mitigate the curse of knowledge make sure your alerting and monitoring layer is done by the “average” responder - in other words, normalize the expertise level you need, to the average you have.
Have your experts review their work, and train the team, but avoid having that “one monitoring person” in your systems.
If complex monitoring and alerting can’t be avoided - document them thoroughly, again - by the “average responder”.
train, and level up your organization.
Don’t let your responders learn during incidents.
We now touched how writing alerts and monitoring can be affected by biases, in the next example we’ll dive even deeper into monitoring.
The next incident manifested two biases, and was one of my most painful production incidents. We’re gonna talk about those biases one by one.
Let’s first talk about the system at hand - We started implementing a new generation services in micro-service architecture, to do so we needed configurations stored in our monolithic main database.
So we’ve created a ETL process that exposed those configurations into a shared storage for all relevant service to use.
One of the configurations there, <PAUSE> was highly sensitive map of which features are enabled for each of our customers.
The incident started with elevated error rates on the API layer - we were rejecting API Calls - some customers were being refused key features of our product.
The alerts were originated from the API layer, and knowing the process, I started simulating what can be the cause for this behaviour.
And i suspected degradation in performance in the shared storage.
We’ve decided to manually re-run the ETL, and in did it solved the issue.
for an hour.
I don’t know if you ever experienced a P1 incident that you thought you’ve solved coming back to haunt you, after you’ve already notified the business and management that everything is back to normal.
This is really an uncomfortable feeling, that made me doubt my engineering skills.
I’ve gathered some my teammates and we’ve started the investigation again, and another hour in, my teammates found the root cause - a bug in the replication process in the ETL.
My teammates found it, but not me.
for that hole hour, while they were going through the code and logs, i was digging into that shared storage, proving (mostly to myself) why it’s indeed in degraded state.
I couldn’t hide my surprise!
A BUG in a component that is working smoothly for more than a year with no scale or any other change.
It was literally the among the last things on my possible root cause list, somewhere around cosmic radiation.
This incident was a grueling process of checking each component in this flow one by one.
The errors originated from the actual SLI that was degraded, but the issue was far up stream.
Why did I dug so deep into that datastore, why couldn’t I see im on the wrong path?
I was deeply affected by the simulation bias.
The simulation bias states that one's judgments are bias towards information that is easily imagined or simulated mentally by them.
And i was simulating the datastore as the cause.
It’s important to mention that the simulation is subjective - what I can simulate, others maybe can’t.
This is why i wasn’t able to simulate a bug, and our data engineers probably wouldn’t simulated database performance issue.
Simulation causes high friction with the production system, an in my case - focusing on the wrong elements.
So, what can we do to control what our responders simulate? This sounds like a challenging task
The problem with the response process was high surface area between the alert and the system. The alert was at the end of a long chain of components, and each one needed to be checked to find the issue.
It was a process of many steps to get to the root cause, and the time spent in that process is time spent on simulating wrong paths.
Firstly, We need to set the responder on the correct path as soon as possible, we should aim doing so, with minimal friction
So alerts and monitors should be VERY specific and on any dependency and key SLI, we better have 20 simple specific alerts than 1 catch all alert.
If the alert was on the ETL process, chances are, i’d start by digging in into it’s logs, rather than working my way back from the API, through the shared datastore.
Secondly, we need to minimize the time to start triage.
Time spent without data, is time spent simulating.
If possible - incident insights should be pushed to the responder, instead of waiting for the responder to pull the information they need.
That means charts, and logs related to the incident can be automatically added to the incident at hand (most incedent management tools supports such integration to some extent).
So far for simulation, and reducing friction and surface area of the alert.
We’re now ready to talk about the final bias, that also attacked me during this incident.
I know what you’re thinking - I am an experienced responder, I should base my decision on concrete data.
I'm not gonna lie - I did, i had data to support my hypothesis.
How was i able to “prove” that the data store is the issue when it wasn’t?
I went straight to the source, to the datastore metrics.
There It was, my smoking gun - the CPU usage increased, and available memory decreased!
In fact,
My dangerous increase in CPU was only 2%.
and the memory? A drop of only 200MB on a large instance.
The axis in the dashboards were dynamic, but I rushed into action and missed that.
This misinterpreted data was enough to convince me, that indeed the database is the issue and send my on a wild goose chase while my teammates were actually narrowing in on the issue.
Why was it so easy for me, an experienced responder, with vast <PAUSE> daily mileage with observability tools, to misinterpret what I was seeing?
It Comes down to the confirmation bias - We try to seek information that reinforces existing positions. We come to a conclusion first and try to find information that fits it.
ignore information, and translate ambiguous information in our favor.
When you think you know what the incident is, it easy to find patterns that re-enforces it.
In my case, a 2% increase in CPU confirmed a wrong hypothesis, and literally made me useless in this incident.
Now when we know what that bias is, how can we mitigate it?
First - Let’s talk about how we show our data.
Keep it simple! Show <PAUSE> simple <PAUSE> data.Complex data is easy to “manipulate” or mis-interpret
That includes sensible Legends, Colors and Scales - errors should be Red, throughput probably green.
In my case - if I’d shown the CPU usage percentage with a static scale of 0 to 100, the there would have been no visible change at all, not to mention a spike.
One more thing about simplicity - show data with as little Dimensions as possible - dimensions are complex! Same goes for multiple axes on a single chart, elaborate coloring schemes, and heatmaps.
Im not saying NOT to use those, but be very aware when you are!
Next, Standardize your data!
When majority of dashboards will be similar - looking into any dashboard during an incident will feel familiar for the responder, thus reducing the chance of mis-interpretation.
In my case, i was rarely using CloudWatch for metrics, therefor i wasn’t fully aware that the scale there is dynamic.
That’s about data, now a bit about process:
Don’t <PAUSE> work <PAUSE> alone, Incident response is a team effort - Show meaningful data that reinforces your positions to your teammates - convince the “unconfirmed” responder that you are correct.
Draw a concrete line between the observed facts, your hypothesis, and the existing production state. Chances are that any of my teammates that would have seen the CPU and Mem chart would have smiled and pointed out my mistake.
This is all the examples i have for you today.
To wrap things up, id like to show a short cheat sheet that can help handle some of the biases we’ve talked about:
Keep anything production simple
Specific alerts, Standardized dashboards
Normalize production status with the “average” responder
Prioritise SLIs (SLI pyramid)
I’m writing about the connection between software and psychology in my blog from time to time and I’ll repost those on twitter and linkedin, so if you’ve found this talk interesting, be sure to check it out.
That’s all i have for you today, It’s been a pleasure,
thank you for you time!