The document discusses how to learn from failures through effective retrospective meetings. It recommends that retrospectives include proper preparation like choosing a facilitator and building a timeline. During the meeting, the most involved engineer should provide context, customer impact should be discussed, and the discussion should focus on process improvements rather than blame. Many potential improvements or "remediations" may be identified. Both engineering and product teams should consider improvements to prevent future issues and improve customer experience. Effective retrospectives can help organizations continuously learn and improve.
48. Is your fix a small thing you
can add to existing customer
tools?
Engineering should be able to do this with
minimal product sign off.
49. You can improve your
customers’ experience.
Your customers, your fellow engineers, and
your community can benefit from your own
needs and hard won experience.
Hi, I’m Joy and I’m the SRE director at Heroku.
For those of you who aren’t familiar with Heroku, we’re a Platform as a Service. This means we handle a lot of the operations work for the customers who run on our platform. My job is to keep our platforms maximally stable so our customers can sleep easy at night.
I'm here to talk about failure and why I love it, or at least don’t hate it.
Why would I want to talk about failure? Failure is amazing — it can be our best teacher. As an SRE failure is utterly crucial to me doing my job. Complex systems often fail and we learn so much more from their failure than from success.
A lot of us have probably had this realization. If we didn’t have failure, we’d be out of a job.
So the question today is how do we learn from that failure? How do we learn from that failure in a way that doesn’t make us feel like failures?
Let's start with an SRE war story — everyone loves a good war story.
How many of us ever run out of integers in an auto-incrementing primary key column in a database?
The whole database halts because it just ran out of numbers. And it’s usually a critical database.
I've seen this failure mode pretty much everywhere I've ever worked as an SRE.
It's pretty embarrassing because seriously -- you just ran out of numbers. It seems really easy to fix but it just keeps cropping up.
So what are some of the reasons that this keeps happening?
Commonly used frameworks have defaults that can come back and bite you later.
Assumptions about the size of your database before it hits production. It’s a good problem to have when you’re successful enough that you outgrew your original assumptions. Two billion — that's a lot of numbers.
Or just not thinking about it at all! That's probably the most common reason.
So we had this happen to us twice in two months. That was pretty bad.
Then it happened a third time almost a year later. For me as the head of SRE seeing this again was pretty painful!
We run a Platform as a Service! Our whole premise is doing operations for our customers so they don’t have to. So how do we fix this problem for real?
First we have to consider it more deeply than we did at the start. If the obvious fix was the long-term fix it wouldn't keep coming up.
It’s simple enough to fix one occurrence of this, just change it to BIGINT. Data starts flowing again, folks go back to business as usual.
When this happened the second time we applied a similar fix, and we also poked around manually at other crucial DBs that might have this problem. We even caught a few before failure that way.
We needed to fix this a lot more systemically. Fortunately there’s a good tool for that!
So who here is familiar with retrospectives? I imagine most people here have been to or at least know about them as a place to reflect on past projects or incidents.
One of the main things that SRE instituted at Heroku were retrospectives for all customer affecting incidents.
If you have been to a retrospective, you probably have been to a boring retrospective. I know I’ve run boring retrospectives. Sorry.
I used to think that if you just got the right people in a room together to chat over an incident, things would naturally happen and we’d have a great, engaging conversation and leave with an amazing solution that would fix our problems. Maybe it would also solve world hunger.
In reality, when you pop a 1 hour meeting on a bunch of folks’ calendars about an outage with no context, this stuff happens:
Some folks don’t show, because they are allergic to calendars, email, and meetings.
The ones that do show might be there because they have an axe to grind, or because they feel like they have to defend themselves.
Establishing the timeline in the meeting leads to bickering and “well, actually” statements that put everyone who wasn’t in a bad mood into a bad mood.
Once everyone is sufficiently miserable, you’re most of the way through your time. You have about 5 minutes to give people some work to do as the cherry on top of the misery sundae.
If that doesn’t happen, everyone is bored and tuned out. The engineers are all doing email. The facilitator is doing email. No one’s paying any attention. At the end of the meeting you have some cursory remediation items and if you are lucky some might actually get done.
For a retrospective to be useful, it can’t be boring. A retrospective is the pivot point between failure and learning. If it’s boring, no one is learning and you might as well give everyone back the time in their day they were sitting in the meeting.
Putting a bunch of highly-paid engineers in a meeting for an hour in which they don’t learn anything is a waste of time, money, and morale.
One problem we had with the first INT rollover is that we didn’t have a retrospective, because folks thought that they were a waste of time for something so trivial and easily understood. They were trying to avoid a boring time consuming meeting without a clear sense of what value it would have.
This makes sense. I avoid boring meetings too. In this case, the problem was deceptive. Had we dug into it the first or even the second time we would have been able to discover that.
So how do you have non-boring, useful retrospectives?
One way to create engagement during the retrospective is by preparing for the meeting. Don’t force people to watch the sausage being made.
It is excruciating for someone to attend a meeting and then have to figure out the timeline, or to find that you don’t have the right people, or even that you have the wrong people and not the right people.
Retrospectives are a big time commitment we expect people to make and we need to make them count. People should know that when they show up to a retrospective that they're actually going to get something good out of it.
The facilitator is the most crucial role in this meeting.
The person should familiarize themself with the facts of the incident -- so ideally they are someone who is adjacent to the incident but not a primary responder, because they're going to be talking a lot in the meeting, and they shouldn’t be asking questions of themselves. The facilitator should know who was involved and why they were involved in the incident.
You should also build a timeline. This can be done by the facilitator while they're gathering all the facts for the retrospective. This is really important.
When I say build a timeline, I don't mean have everything down to the second of precision and every little tiny detail. It should be an overview.
Think of it as a narrative - how would you tell the story of this event? If you were telling a story, you would have a beginning, middle, and end. You’d cover salient points. And you probably wouldn’t be going for microsecond precision.
Any good engineer needs their tools. When I talk about tools, I don’t just mean stuff that you can check into a repo. I mean mental tools as well.
Here’s an overview of the tools I most commonly use to create engaging retrospectives. There’s nothing magical about any of these -- you can use them too.
I’ll take you through them.
Why chat? Audio transcriptions are error-prone and time consuming.
We run all our incidents, and indeed our day to day communications, in chat. That means everything has a transcript that you can refer back to. People can communicate in parallel -- you don't have to worry about interrupting someone on the voice bridge, and you don’t need someone to transcribe what’s happening on a voice bridge. You can copy and paste commands as needed.
I don’t care which type of chat you use, as long as you use chat.
Bot tools include incident management tools built on top of our chat bots. One example is here, where we recorded something for the timeline of this incident.
We deploy in chat, and deploys emit chat notifications. Pages alert in chat. We also have incident-management specific tools we wrote that can create notes for building a timeline or questions to follow up on while the incident is ongoing.
This makes the gathering information process for the retrospective much easier. It’s also great for transparency and discoverability amongst our engineers.
SitReps (or situation reports) are a common pattern in incident response anywhere.
You just want to periodic summary of the situation. This isn't what you're telling to customers -- this is what you're telling to people internally. You can of course use jargon, you can use acronyms, and you can you don't need to polish it the same level as you would customer-facing communications.
The goal is to make sure that responders have check points to guide themselves with as they work on the incident, especially as new folks come in.
These are also very helpful when you try to understand what happened after an incident -- sitreps give you milestones of what happened and when.
People underestimate the amount of time it takes to run a good retrospective. I'm not just talking about the time that it takes in the meeting. Prior preparation generally shortens the amount of time you all have to spend in a room together.
Block out time for yourself to prepare at least one day before the retro is scheduled.
Make sure all key players (including the incident coordinator and the communications people) are available and plan on attending the meeting. If someone crucial can’t attend, either reschedule or have someone who can speak for them show up instead (such as a team member).
Make sure you have a note-taker, someone who isn’t a primary responder so they won’t have to talk and take notes simultaneously.
In general, be organized. Send out the agenda, including the timeline, the day before. Make sure the room is booked ahead of time and A/V is working.
When everyone shows up with context retrospectives can get to the interesting bits faster. Who doesn’t love dissecting a failure in a complex system?
I love doing this and I know a lot of us do, because that’s why we’re in SRE.
So everyone is in the retrospective and the timeline is done. How do we start?
We set context, we keep it short, and we don't do the litany of timeline reading. Think of telling a story.
Have the most involved engineer give a brief summary of what happened. They should stick to the facts and really take less than five minutes. The goal is to make sure that everyone really orients themselves to What happened.
One thing I should say is that a retrospective should happen within a week of the incident. People should still have this relatively fresh in their mind by the time you go to retrospect. Otherwise you're wasting people's’ time, and you missed a chance to strike while the iron is hot and folks are feeling motivated to tackle remediations.
Once you are actually in the meeting you're going to want to read the room.
As a facilitator you need to make sure that everyone is engaged. You yourself need to be very present and active part of leading the discussion. Don't be the note taker -- make sure someone else is the note taker.
You'll need to ask questions of everyone, especially the quiet folks. Some people will want to dominate the conversation and some people will never want to jump in but that quiet person probably has some really good insights.
You should talk about customer impact!
We should be compassionate for what your customers felt during the outage. It's not just that you woke up at 3 AM because your database ran out of numbers -- your customer who might be running a business on your platform and maybe is around the world could have lost some valuable business, or some important work, and we need to be aware of that disruption.
Take note of interesting questions, statements, and points of confusion. This gives you jumping off points for deeper conversations.
When we’ve established context we can start diving into these things.
Once you have some starting points to start your questioning, dive in. There are various methods you can use to formulate questions for investigation.
A lot of people like the 5 whys -- I think that it’s interesting (it was created at Toyota) and very logical for engineers to grasp, but I like more flexible methods. I really like John Allspaw’s Infinite Hows. Asking “why” can frame the conversation in a more blameful way than asking “how”.
I don’t think this needs to be prescriptive, though. Simply don’t stop asking questions until you have gotten many layers deep.
Really really important -- if you ever get to human error, keep digging. Your systems are created and operated by humans for humans. Human error is a constant.
I cannot emphasize this enough! You have to work around and with human error.
Have you ever heard the phrase “Linux is user-friendly, it's just picky about its friends”? I disagree. Linux is dangerous. Complex and powerful tools can be dangerous. If you can take out your system with a typo your systems are too fragile, because someone is going to make a typo.
If someone skips a step or makes a typo due to exhaustion or in attention, that’s not on the engineer.
Always assume good intent. Humans get tired, humans get burnt out, humans get distracted. And humans run your systems.
When we build and maintain complex systems we have to develop interfaces for them that are as tolerant as possible to human frailty. The bonus here is that we like working with systems like this. Less friction and stress over using your tools means happier engineers, and happy engineers mean better work.
Usable, beautiful tools are an investment in scaling and reliability.
A reason to be very careful about respecting human failings is that we don't want to make people feel defensive.
When someone feels that they have to defend themselves, they throw up shields.
After that point, you won’t get useful information out of that retrospective. Folks need to feel safe to disclose mistakes they have made. That's how we find out how to fix these gaps in our tools.
One way you can tell a retrospective was good is in the end you have a ridiculous list of remediation items.
Remediations can be big and sweeping, to tiny and tactical, to completely absurd.
The ridiculous means you made it to the end of the questioning line!
Don’t feel you have to do every remediation that comes out of a retrospective. Give yourself the freedom to think about all the options and narrow them down afterwards. Narrow down what you can commit to only after you’ve been creative.
Don’t discount big projects either! That’s the really interesting work.
This is where it helps to understand your company’s process for bringing new work into engineering.
All too often we focus on remediations we can do quickly and within one team. We should be thinking more holistically.
Product is often really excited to hear new ideas. It’s their job to think about how to improve customer experience and what new things customers want.
SREs are great at finding problems and Product is great at finding solutions.
An example of something that came out of a common need for our engineers and our customers -- Heroku Pipelines. We use this for our own internal deployment flows! A lot of Heroku runs on Heroku.
Apps in a pipeline are grouped into “review”, “development”, “staging”, and “production” stages representing different deployment steps in a continuous delivery workflow.
You don’t have to build something huge to be customer facing. A lot of time SREs think of ourselves as internet plumbers (or janitors) -- no one knows we’re there until something’s broken. That’s valuable!
It’s also gratifying to see your work in front of a customer.
Don’t limit yourself to behind the scenes work. Don’t settle for tools that are unpleasant to use. Don’t prevent yourself from bringing up ideas because it will require cross-team or cross-functional collaboration.
You can improve your customers’ experience and your own.
Back to our war story. What did we actually do to fix our INT rollover problems?
Well, we added tooling to easily detect rollover conditions and give you a heads up to fix them before your database comes to a halt.
There’s a heroku postgres tool called pg:diagnose, and it will now alert you when 75% and then 90% of your integer sequence is consumed.
We also added process. There’s a productionization checklist that services should be going through before they hit production. We added an item to ensure sequences are in BIGINT. There’s no reason for us to use integer rather than bigint columns for sequences in Heroku Postgres.
https://www.flickr.com/photos/peretzpup/2361847171/
And of course we could and will improve.
We’d like to have this check scan our production databases automatically and alert before failure. Then of course, we could give that option to our customers.
https://www.flickr.com/photos/nnova/2967902322/
We also are sending pull requests to at least one common open source framework (yes, still looking at you, ActiveRecord) to set better defaults.
Thanks for sticking with me while I explain why I love failure.
We’re all going to fail at some point, and operating distributed systems means your odds get much higher. It’s way easier to fail when you remember that every failure is a chance to learn.
Make them count!
Some relevant links! I hope these help you.
Thank you for your time today.