The history of
Tanya Reilly @whereistanya
When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug,
we usually have a contingency plan. We reduce damage, redirect traffic, page someone,
drop low-priority requests, follow documented procedures. But why do many failures still
come as a surprise? In this talk, we look at some real life analogs to preventing and
managing software failures. Fire partitions. Public safety campaigns. Smoke alarms.
Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. What can we
learn from the real world about expecting failure and designing for it?
016.svg Public domain.
Slide template started as Oivia from SlidesCarnival and then drifted into something
"When we first dropped
our bags on apartment
Welcome To New York
Good morning! So, I'm a New Yorker. I'm not from the US -- I'm an immigrant -- but
one of the many things I love about New York City is that you move here, and it’s
immediately your city. The number one criterion for being a New Yorker is wanting to
be a New Yorker. It's a welcoming place. So good morning to my fellow New Yorkers,
wherever you're originally from, and, if you're travelled to be here, welcome to New
York. We're glad to have you.
I work in Site Reliability and I'm especially interested in what happens when things
fail, the contingency plans we use to recover when something breaks. And last year I
was thinking about that a lot and walking around the city and I started really noticing
that New York is *covered* in fire escapes. They’re a contingency plan too. They’re
for incident response. You don’t use them until all of your regular methods of getting
out of the building have failed.
So I started reading about fire escapes.
content warning: fire
Before I say more about that, let’s talk content. This talk is about at disaster
prevention and disaster recovery in software, by looking at parallels in building fires.
This will include stories of some of the worst fires in the history of new york city.
We'll be looking at the reasons fires started, the stuff that helped them spread and
how people died. There's also some pictures of buildings on fire. Nothing lurid, but
there are pictures.
If you have raw feelings related to recent fires, this could be rough.
If you'd be more comfortable skipping this one, you should do that with my blessing.
While you're packing up, I'll even tell you what I'm going to say, so you don't miss
Fireproof buildings are more
effective than fire escapes.
Fireproof software is more
effective than incident
Where's our fire code?
Here's my thesis
● fire escapes are a hacky bit of afterthought tacked on to the outside of a
building after the building is finished. If you're using fire escapes, it's worth
making them as good as possible, but you’ll prevent more fires if you build
● Similarly, incident response is often a hacky bit of afterthought tacked on long
after software is released. Again, great incident response can help you recover
faster than if you don’t have it but… you’ll prevent more outages if you build
● Finally, buildings have an extremely detailed fire code, but we don't really have
an extremely detailed systems engineering code for software, and I think we
Now I'm going to say the same thing but take 35 minutes.
How Much is that Doggie in the Window? https://flic.kr/p/72Lhz1 CC BY 2.0
CC BY-ND 2.0
Fire escapes were really only built in New York City for a hundred years. They weren't
common until the 1860s, and in the 1960s they stopped being allowed on new
There's some debate now about whether we should start removing them in places
where the building has been upgraded, or whether they should be preserved as part
of the city's history.
I think at least some of them should be preserved. Look how beautiful that is!
Claudia Heidelberger CC BY-ND 2.0. https://flic.kr/p/oqYYv1
And here's another lovely one. They made an effort to have it match the style of the
building, not feel like a separate thing tacked on at the end. And I think that's key.
Dan DeLuca CC BY 2.0
"fire escapes were
attached to the
Richard Plunz, a History of
Housing in New York City
But most of the time, the people adding the fire escape didn't think of it as part of the
building .As this quote says, fire escapes were haphazardly attached to the most
elaborately designed facades. The facade of the building was architecture but the
fire escape was law.
It was an external contingency plan, not part of the main structure. And I think that's
part of why fire escapes ended up not being successful.
A brief history of
New York City fires
(With apologies to actual historians)
But I'm jumping to the end. Let's look at the evolution of New York City's fire code.
By the way, my great fear now is that there’s a building historian in the room who
will listen to this and be like “Nope, that is really not what happened." Please forgive
any errors, building historian! If i made mistakes, I would love if you would come tell
me at the end!
On to the history. We’re skipping the great fire of 1776, and jumping straight to 1835
and the Financial district.
This was a commercial, not residential area, and as a result the number of fatalities
was comparatively low -- two people -- I mean, still, two too many, but this is mostly
remembered as a fire that cost a LOT of money. Almost 700 buildings were
destroyed. The city had 26 fire insurance companies. This fire put 23 of them out of
_the_City_of_New_York_Dec_16_1835.jpg Public domain.
no failure domains
contingency plans failed
exhausted incident responders
The fire was caused by a burst gas pipe in a maze of wooden warehouses. Wood
burns easily so there were no failure domains: the fire spread very quickly. Inside two
hours it covered 17 city blocks, most of the financial district.
The city's water supplies were low and the typical contingency plan was to pull water
from the rivers, but it was a freezing night in December and first the firefighters had to
cut through ice.
At the time it was also common to use gunpowder to level buildings and stop the fire
spreading. But they had used up all their gunpowder on a fire two days earlier. That
fire involved the entire fire department of 1500 people, and they were still exhausted.
Still, they fought the fire for 15 hours until marines from the Brooklyn Navy Yard
arrived with more gunpowder and blew up some buildings along Wall Street to make
dedicated incident responders: a
professional fire department
new infrastructure: the Croton Aqueduct
better incident response
As a result of the fire, the city stopped using volunteer firefighters and moved to a
professional force with better equipment.
And they built the Croton Dam and Aqueduct. It was built because of the fire,
but a reliable water source is good for lots of reasons!
No longer in use, btw. It was replaced with the New Croton Dam, which still
supplies a small fraction of the city's water. The old one is on the National
Register of Historic Places.
robust structures: they rebuilt in
But more importantly, as well as better incident response, they took
the opportunity to make a more resilient city. The fire spread fast
because the buildings were made of wood. They rebuilt with stone and
And this paid off, ten years later, when there was another enormous fire. The
great fire of 1845 was very bad -- thirty people died -- but it didn’t spread
as far or as fast, because it slowed down when it hit those new brick buildings.
Let’s jump forward 25 years and talk about tenements. Tenements were extremely
dense, extremely terrible housing. I'd read about tenements but hasn't realised the
scale of them. In the 1860s, nearly 500 thousand people -- more than half the city --
lived in tenements.
The population of New York City doubled every decade between 1800 and 1880.
Maybe you've seen this with teams and software systems: when you grow rapidly,
you can build some culture problems and some technical debt. This was certainly the
case here. Landlords made more accommodation by splitting big rooms into many
smaller ones, mostly with no light or ventilation. These were really awful places to live.
They were crime riddled, filthy and filled with disease. Every report about them
mentioned that they were fire traps.
In 1860, two tenement fires happened back to back.
45th street fire:
Quote about the buildings from that second article:
“If a skillful man, with a deadly hatred of his race in his heart, sat down to plan a
human residence in which to entrap and destroy those who should dwell in it, it is
extremely probable that if he had seen these houses in West Forty-fifth-street he
would take them as a model. “
obsolete contingency plans
no failure domains
The first one, on Elm Street, started in a bakery on the ground floor of a large
residential building. Terrible place for a bakery, but that's where it was. The baker was
storing a lot of hay and wood shavings, and when they burned they made dense
smoke, killing some of the people who lived in the higher floors before the fire even
got up there.
The wooden stairway quickly burned away, trapping people on the top floors.
Firefighters arrived with ladders, but the ladders only went to the fourth floor and this
was a six storey building. At least 10 people died.
A month later four houses burned on west 45th street. These houses had roof
hatches called scuttles, which should have let people escape across the roofs, but
they all were missing their ladders so nobody could get up there. Another ten people
An optimistic disaster plan is a useless
These escape plans -- the ladders and scuttles and the roof -- had worked fine for a
previous iteration of shorter NYC buildings, but they hadn't been updated for the new
shape of the city.
Just like with the water and the gunpowder, there was a plan in place for a fire
disaster. And just like them, the plan only worked in the most optimistic
We see that all the time. Backups that will work if we lose the database in a very
specific way. Failover plans that only work if we have two weeks notice of the failover
and the old data center doesn't lose power.
new law: an Act to Provide Against
Unsafe Buildings in the City of
The city immediately passed a law to make the tenements more robust against fire.
They even put an injunction on new tenement construction until the law was passed.
Now houses for more than eight families (kind of specific) had to have fire-proof
stairs either inside or outside the building.
What’s frustrating about this is that four years earlier a commission had reported that,
if there was a fire, tenants on the 6th and 7th floors of tenements had basically zero
chance of survival. They recommended fire proof stairs. But nothing happened until
a bunch of people died.
have fire escapes...
Seven years later, the Draft Riots (which are a whole separate awful thing in which a
whole bunch of people died) led to another law: the Tenement House act. This act
had good goals but it was extremely unsuccessful.
Buildings had to have a fire escape, but they didn't have to make anyone safer! So
landlords put up fire escapes that couldn’t hold the number of people in the house, or
that weren’t well attached to the walls or that were just a rusty ladder. And what even
was a fire escape? Well, it wasn't well defined.
Let's take a diversion and look at some fire escape patents.
As we look at them, you might want to think of disaster recovery plans you have
known and loved.
The picture’s actually from 1900 but whatever :-D
et.4a18586.jpg Public domain.
This is a ladder with a counterweight. Imagine climbing down from the 7th floor of
your building on one of these. With your six children. In the rain. In a dress that went
to your ankles.
877.jpg Public domain.
This is a kind of rope ladder that attaches to a window sill.
This is a parachute that rolls up very small. The idea was that you'd carry it with you
everywhere in case you were in any tall building fire situations.
According to this patent, and I quote: "A person desiring to escape seizes one
member of the cord, rope, or chain, as shown in Fig. 1, and forthwith jumps out of the
Like, I am looking at this thing and do not feel like I could forthwith jump out of
Anna Gonnelly's fire escape was a bridge that you could sling from your roof to
another building. It had side rails, so it was only moderately terrifying.
This one is just fantastically ludicrous. But good if you want to fight supervillain crime?
All of these patents were granted, btw.
GOOGLE PATENT US 912152 A
And this one… You might think that this is just a parachute helmet. It is not. It is a
parachute helmet and a pair of very bouncy shoes.
GOOGLE PATENT US 221855 A
Finally, I've read this patent three times and I'm fairly convinced that the guy invented
a rope. It's the most silicon valley invention of 1882.
Though, let's be clear, rope was a popular kind of fire escape. In fact, it was the state
of the art for hotels.
Puck Magazine, 1887
I don't mean a ladder made of rope, I mean literally a rope. Every hotel room had to
have a rope and that was the only fire escape. Even at the time, people found that
This is part of a snarky cartoon from a magazine called Puck, published in 1887, of a
whole lot of people trying to use the ropes.
Like most of those other parents, it's designed for the easiest case: someone
with upper body strength and agility who isn't wearing a skirt or carrying a
child. If your disaster plan only works for the easiest case, it's not a good plan.
I want to emphasise here that a rope is better than nothing. In fact, probably every
one of these fire escapes, even mister parachute hat, is better than nothing. But these
escape plans are not where I would put my efforts if I wanted to have fewer people
die in fires. But this is what the law focused on.
Pre 1923 so public domain
Tenements must also
Anyway! The Tenement House Act.
Even with fire escapes, tenements were still terrible. They were badly constructed,
overcrowded, and -- I find this amazing -- it was perfectly legal to store lots of
combustible materials in them.
One other thing the tenement act said, was that every room now had to have a
window. And just like “what even is a fire escape” it didn’t define “what even is a
window”. So the landlords cut holes in interior walls between rooms and called them
A decade later, the law said sigh, ok, exterior windows. So landlords started
constructing buildings with air shafts, little narrow gaps between buildings. Now,
picture it, you have no indoor plumbing and the bathroom is down six flights of stairs
and now you have an air shaft. You can imagine how that goes. One article I read
described the air shaft as “festering tubes of disease”. Very poetic!
And many of the fire escapes just led down to these air shafts and there was no way
out from there.
g Public domain.
Carla Geisser CC BY THANK YOU CARLA <3
By 1871, iron fire escapes were becoming common and of course people were using
them as extra space. You still see that now -- they're used for bikes and gardening
and barbecues and cat runs. All of that has been illegal since 1871. Because it makes
the fire escape very hard to use in a fire!
A later law said that every fire escapes had to have a cast-iron sign saying that you
could be fined for obstructing your fire escape. And it was fair, because usable fire
escapes are better than unusable ones.
But, again, it was still perfectly legal to run your explosive business out of a tenement
basement and tons of residential fires started because of deep frying crullers. And
anyway, the regulations were mostly not enforced, so people didn't pay much
The encumbrance sign thing is from 1885, but encumbrances were illegal from 1871
and mentioning this many dates makes *my* ears glaze over and I'm already
interested in this. So we're conflating two things to keep it moving along.
Image by Carla Geisser, used with permission.
In 1876, the Brooklyn Theater on Cadman Plaza.
The final act of the play was about to start and the stage manager noticed a very tiny
fire on the left of the stage.
m_Johnson_Street_Looking_East.jpg Public domain.
obsolete contingency plans
unpracticed incident response
It was typical to keep buckets of water next to the stage, but there weren't any. There
was a fire hose, but too much scenery was piled beside the stage and he couldn't get
to it. There's those encumbrances again.
The stage manager asked a couple of carpenters to put the fire out by beating it with
poles. This didn't work and actually spread some sparks, setting fire to the loft.
The actors -- laudably -- wanted to avoid a panic, so they announced that the fire was
part of the show, and that people shouldn't freak out, but once the audience realised,
they stampeded. And they had trouble getting out. We have a real stampeding herd
problem here: there was only one stairway down from the cheap seats at the top, and
everyone trying to use it at once. It filled with smoke. There were no fire escapes and
some exits were locked to prevent against gate crashers so people couldn't get out
278 people died. At the time, it was the worst theater fire in US history. It's now the
third worst because we really don't learn.
new laws for exits and encumbrances
automated response: sprinklers!
The jury blamed the theater owners for not obeying a bunch of existing fire laws, and
new laws were written, including widening exits and not storing stuff on the stage. In
1882, the building code said that theatres had to have automatic sprinklers: it's the
first type of building in the city to require sprinklers. The first automated response.
What I find remarkable is that this fire happened nine years after regulation said that
tenements had to have safe exits, but those laws didn't carry over to theatres, or to
other types of buildings like: hotels, schools, factories, ships, offices. I'm going to
spare you most of the horror stories, but we'll look at factories in a minute, after….
...we get proper no-kidding tenement regulation at last! And we even do it without a
bunch of people dying!. Thank you Jacob Riis!
In 1890, this guy called Jacob Riis published a book about tenement life called How
the Other Half Lives and did a lecture tour on it. And up until now the upper and
middle class people of New York City had sort of known the tenements were awful,
but for the first time ever, there were photographs. It was harder to ignore. Well, it was
probably part empathy, part fear of smallpox coming out of there but, whatever, over
the next decade, people started to care.
I was really reassured when I read this, because until then it had been all “there was a
horrific fire and we added a very specific law and then there was a different horrific
fire and we added a different very specific law”. And it was mostly like that! But this
Tenement House Act came from someone saying “wow, look how much this sucks” in
a compelling way. And that gives me hope!
Anyway, the next couple of Tenement House Acts included having to have actual
windows, not air shafts, and fire escapes couldn't be ladders any more: they had to
have open balconies and stairs and be properly attached to the wall. Even better:
your neighbours can no longer boil oil in the basement! Hurray! And all new
construction has to have interior fire partitions. Failure domains!
We're finally looking at stopping fires from starting and spreading, not just escaping
from them. And, best of all, it’s all actually going to be enforced. Welcome to the 20th
But, oh yeah, it still sucks in factories.
:How_the_Other_Half_Lives_front_cover.png Public domain.
Public domain because pre 1923.
https://commons.wikimedia.org/wiki/File:Jacob_Riis_portrait.jpg Public domain
because pre 1923.
The triangle shirtwaist is the famous one, but the Newark factory fire a few months
earlier is a textbook disaster waiting to happen so I wanted to talk about it.
This building had two fire escapes -- look at the size of this building! One of them was
a really heavy ladder that needed to be lifted into place. Another emergency plan that
only worked for people with good upper body strength. In the fire, the young women
who worked in this factory weren't able to lift down the ladder. So.. only one fire
untested contingency plans
The building was shared by a couple of paper box companies, a nightgown factory
and a lamp manufacturer. It had previously been used by machine companies and the
floors were soaked in oil.
A fire started in the lamp factory. There was no fire alarm, and the bottom three floors
had evacuated before they realised that 116 people up on the 4th didn't know there
was a fire.
This building had had ten fires in ten years and the buildings department had
condemned this factory three times, but the factory owners basically ignored them
and kept running. All of that was expensive for insurance and they didn't want another
fire on their record, so they delayed calling in the firefighters, even though the
firehouse was just across the street.
The firehouse had a policy of reprimanding their firefighters for false alarms -- no
blameless post-mortems here! -- so before raising a general alarm, they sent a
couple of guys over with a fire extinguisher, delaying the real response even more.
The only door up to the 4th floor was kept locked, which was against the law. The
windows wouldn't open and the victims had to break glass with their hands. The
window sills were four feet off the ground and the platform up to them broke under the
weight of people trying to get out.
And the victims had never been in a fire drill and they had no idea what to do. They,
quite reasonably, freaked out.
25 people died, 32 more were very badly injured.
I feel like I could spend an hour just talking about this fire. There's so much to learn
http://www.oldnewark.com/histories/factoryfirearticle.php is really good and I
recommend it, if you don't mind being angry)
Human error is never the root cause
When officials investigated, they said the root cause was not the walls soaked in
grease, or delaying calling fire fighters, or the locked door, or the lack of smoke
alarms or the unusable fire escapes. It was that "the victims merely succumbed to
The way humans react to a disaster can definitely make the situation worse --
remember those carpenters with sticks in the theater -- but that is in no way their fault.
Humans will act in human ways. If your systems can't handle that, and you haven't
invested a lot of time in training the humans to act in some other way, your systems
---State Farm CC BY 2.0
“They died from misadventure and
So what happened? Nothing. The jury didn't convict, though at least one juror later
said he regretted it. New Yorkers did look a bit at their factories and say "huh, I
wonder if we should care about that"..., but nothing changed. Is it because it
happened ten miles away instead of on the island of Manhattan? No idea. The New
York Fire Chief said "This city may have a fire as deadly as the one in Newark at any
Four months later…
"They died from misadventure and accident" from
"This city may have a fire as deadly as the one in Newark at any time." from
146 people died inside 18 minutes. The famous Triangle Shirtwaist Fire.
riangle_Shirtwaist_Factory_fire_on_March_25_-_1911.jpg Public domain.
obsolete contingency plans
This building was considered fireproof. They had done it right. They built a good
building. But it was packed with garments hanging so tightly together that the building
might as well have been made out of cloth.
The building should have had three fire escapes; it had one and that collapsed under
the weight of people escaping. Fire fighters came but the fire ladders and the water
could only get to the 6th floor and the city had gotten taller again: the factory was on
the 7th to 9th.
One exit was locked; the guy with the key escaped without unlocking it.
And the employers already knew about the problems. Employees had organised a
strike the previous year to protest the working conditions, and they'd been fired. The
building had had a recent warning notice from the department of sanitary control, but
they hadn't fixed their violations.
better tools: stronger pump, longer
better incident response
The fire department developed a stronger water pump and a longer ladder, so
they could reach taller buildings.
laws: 60 in three years
automated response: sprinklers
accountability: the American Society of
But more importantly, building conditions took a big step forwards. There were 60
new laws over the next three years. Again, everyone knew factories were bad. But,
again, the law didn't change until a bunch of people died ON THE ISLAND OF
Sprinklers started to be required in factories. (But only factories over seven stories
tall. Very specific again.)
A professional organisation, the American Society of Safety Engineers (which still
exists), was founded.
After the fire, the owners of Triangle Shirtwaist factory, Harris and Blanck, were
brought to court on charges of manslaughter but were eventually acquitted. They
were fined $75 for each life lost. However their insurance policy paid them a total of
$60,000, at the rate of $400 per life lost, so they actually profited from the tragedy.
After two years, they continued to lock the doors to exits and were fined for several
safety code violations. The worst people :-(
"...a type of exit condemned by
the experience of many fires"
NFPA report, 1914
And at last, people started to look at fire escapes differently. After the disaster, a
report called them "a pitiful delusion." and "a type of exit condemned by the
experience of many fires".
Barbara L Hanson CC BY 2.0
Dan DeLuca CC BY 2.0
Eden, Janine and Jim CC BY 2.0
don toye CC-BY-ND 2,0
Kristine Paulus CC-BY-ND 2.0
"...a type of exit condemned by
the experience of many fires"
NFPA report, 1914
The report called out a lot of reasons fire escapes are terrible:
● the platforms are too small
● people put stuff on them
● they don't get a lot of maintenance
● snow and ice makes them slippy and dangerous
But most importantly
● they never, ever get tested.
Kristine Paulus CC BY 2.0. https://flic.kr/p/fszEDf (plants)
Dan DeLuca CC BY 2.0. https://flic.kr/p/5hsnTM (chairs)
Eden, Janine and Jim. CC BY 2.0. https://flic.kr/p/7G1tWZ (snow)
Barbara L. Hanson. CC BY 2.0. https://flic.kr/p/8uxpcf (rain)
Don toye, CC BY 2.0 https://flic.kr/p/9XrAs (bike)
“ ... fire escape collapses during
times of intense use – such as
during actual fires.
John W. Cramer, The Story
of a Tenement House
Fire escapes were known to collapse during times of intense use. But they
pretty much have one time of intense use. If they're going to collapse, it's
going to be during a fire.
So what do we do?
We have a couple of options here. We can add more regulations around fire escapes:
you have to maintain them, you have to try them out every year! There actually was a
law about regularly painting your fire escape. To prevent against slipping you have to
build a textured floor into the fire escape and leave a pair of shoes with good grips on
the top of each one… Or we could step back and ask whether we're optimising for the
A photo called "Fire Escape Collapse" received a Pulitzer in 1976. It's fairly
harrowing, so I'm not linking it here -- extreme content warning if decide
you go look at it -- but it made Boston rewrite its fire escape safety laws.
Journalists are amazing.
New York Times, February 25th, 1923
In 1923, the New York Times had an article praising fireproof interior walls: "For six
years there has been no loss of life by fire in the 200 buildings so treated."
It blows my mind that a group of 206 buildings having no fire deaths in six years was
In 1929 those fireproof walls became code: all new buildings over 75 feet in height
had to have them, and also had to have two fully enclosed staircases! Failure
domains are part of the code at last!
shall not be
permitted on new
John VanderHaagen CC BY 2.0
The idea of building better buildings gained traction and in 1968 fire escapes stopped
being allowed at all. The code still says "Fire escapes shall not be permitted on new
The 1968 code also required sprinklers for hotels and high-rise office buildings, but
not nightclubs or residential buildings.
" Fire escapes shall not be permitted on new construction, with the exception of group
homes. Fire escapes may be used as exits on buildings existing on December sixth,
nineteen hundred sixty-eight when such buildings are altered, subject to the approval
of the commissioner, or as provided in subdivision (b) hereof. "
More fires. More very specific laws.
1975 - 2018
● In 1975, seven people died in a nightclub, so, sprinklers for required for
● In 1998 there were two bad residential fires, and now you have to have
sprinklers for residences with four or more units.
● And I'm sure this story is not over and the code will be expanded many more
times in response to very specific things in which a bunch of people die.
Btw, there's no retrofitting of existing buildings. The laws only apply to new buildings
and existing buildings get better as they're renovated. So buildings in NYC comply to
the safety standard of whenever they were renovated last. Think about that, wherever
you sleep tonight.
So that was 150 years of fire codes. For decades we considered it inevitable that
fires would start and spread, and we optimised for escaping from them. And we
definitely got good at responding to massive fire disasters. But slowly we made
progress on other, more important parts of the fire life cycle. Which I'm going to
describe in four stages:
making it harder for the fire to start
We prevented sparks. A certain amount of sparks are ok! We need to cook food and
have birthday candles. But by becoming more deliberate about when we make
sparks, we made it harder for the fire to start at all. We moved bakeries out of
residential buildings, began doing wiring inspections, did public safety campaigns
about cooking and smoking.
stopping it while it's small
We worked on detection and immediate amateur response: smoke alarms, fire
blankets, fire extinguishers, and more public safety campaigns. And we introduced
preventing it from spreading
3. We introduced failure domains, to keep the fire to one small part of the building or
city. We started using materials that were hard to ignite so the fire would spread
slowly. And we did fire drills, to move humans quickly and safely away from the
danger area and to prevent the kind of panic that makes things worse.
okay, we're fighting a fire
And only then, 4, emergency response. We also got better at responding to massive
fires. The New York Fire Department is *very good*.
But step 4, this is our last resort and we should try not to rely on our last resort. We
gained more from stopping the fire from getting to this point.
And, if you missed my extremely subtle metaphor here, it's the same for
Image: skeeze. CC0. https://pixabay.com/en/firefighters-training-live-fire-696167/
reliability is everyone's job
The most important reliability work is making problems stop before they get to that
This means that reliability is everyone's problem. Everyone who's writing code or
designing systems should have reliability in mind.
Yeah, some people have a site reliability team. Just as we have people who specialise in
UI or security, both of which we should all care about, we can have people who specialise
in reliability and advocate for it. But, while SREs may occasionally act as firefighters, the
more important part of their job is to be the fire safety engineers, handing out smoke
alarms, legislating fire partitions, pointing out buildings that are made of wood,
advocating for the removal of clutter, educating everyone.
The part of their job which is being last resort firefighters? That skillset should be used
rarely. You don't want the NYFD running into your kitchen every time you burn toast. If
you're calling them in, it's a sign that something's gone horribly wrong. But it's still very
common to have firefighters reacting to every software problem.
There's a really nice tradition in the ops and SRE communities, where if a site is
down, people send #hugops on twitter to the people working on it. I want to
particularly call out Baron Schwartz sending hugops in advance to people running
mail servers on GDPR day :-D
I love #hugops. I send #hugops. But one thing you'll notice if you follow the hashtag is
that… a lot of things break and nobody is really surprised.
We're at the stage of software evolution where we expect software to fail. We need
to build better buildings in software too.
And that means we think about those same four stages.
Tweets used with permission.
making it harder for the fire to start
Just like with buildings, a certain amount of sparks are fine for us too! We need to
make changes. Maybe something gets overloaded or a user does something we
didn't plan for. Many of us use the concept of error budgets: depending on how close
we are to missing our SLAs, we make more or fewer changes.
We can reduce our sparks:
hiding the matches
Michael Chen CC BY 2.0
We can think about how users use our tools and provide clean, safe, validated
interfaces that are hard to get wrong. We can restrict their access to functionality or
data they don't need. A stove igniter is a better tool than a box of matches.
https://flic.kr/p/LdPYz Michael Chen CC BY 2.0
State Farm CC BY 2.0
We can make it a standard to inspect our systems, looking for regressions, looking for
what has bitrotted or become overloaded. A thorough test suite is like a wiring
inspection that runs on every deploy.
And we can do chaos engineering: continually testing the system's resilience against
https://flic.kr/p/duWtgw State Farm CC BY 2.0
stopping it while it's small
But, ok, sometimes, inevitably, things go wrong. We have an opportunity to put this
fire out while it's tiny.
https://commons.wikimedia.org/wiki/File:Fire-blanket-on-display.jpg Public domain
topquark22 CC BY 2.0
Humans can react quickest if the right fire extinguishers are available. Provide a
one-click rollback for all your changes. Use canaries: push the change to one
instance before we push all the instances. And launch with feature flags to push out
new features in a way that makes it very fast to turn them off if you need to.
Alerts need a fine balance, as everyone knows who’s ever had an over-enthusiastic
smoke alarm in their kitchen. An occasional false alarm is ok, but having humans
continuously react to small problems can burn them out. It's using up your gunpowder
on small fires and not having enough left for the big ones! So aim to keep your false
https://flic.kr/p/6AcBru topquark22 CC BY-2.0
https://pixabay.com/en/fire-extinguisher-fire-delete-99915/ Public domain.
HomeSpot HQ CC BY 2.0
But even better, don't get humans involved at all for small things. Add automatic
recovery. If a machine dies, it should automatically be replaced. If a backend goes
missing, we should be able to coast for a while. Health checking and load balancing
should move traffic from an unhealthy region to a healthy one.
Maybe you want to let humans know, but the message they should get is "everything
is under control but you might want to look at this when you get a chance". Not
"WELCOME TO 3AM! A MACHINE DID A THING".
https://flic.kr/p/fmr7a7 HomeSpot HQ www.homespothq.com
preventing it from spreading
Stage 3: Ok, there's a fire, it's happening. Now we want to not let it get on anything it's
not already on.
Achim Hering CC BY 3.0
Failure domains split our systems up so that only one part of it should be affected by
any given outage. And if the problem's going to move as components get overloaded,
we want that to be slow enough that we can control it, not an immediate cascade. And
we have our own version of moving bakeries out of residential buildings: we can
isolate risky customers on their own replicas or shards.
State Farm CC BY 2.0
Just like we make it incredibly common to hear a smoke alarm and find our way
outside, make it so that a disaster is never a surprise. Humans will panic the first time
they hit a situation that's outside their comfort zone. At intervals, tell people you're
doing a controlled outage, and take a system offline.
You know the phenomenon where you're fixing something and you hit a bunch of
unintuitive commands, or out of date documentation, and it ends up taking you much
longer to do something simple? Or you even end up breaking something else? These
traps are a basement full of straw, or a fire hose with cluttered scenery on top of it. It's
making it very, very hard for you to move around safely as you try to fix the real
problem. Push back on technical debt and clutter.
Fatigue is an encumbrance too. You're way more likely to make a mistake if you're
exhausted. Set rules about how long a person should deal with an incident before
their on call shift is over and someone else needs to swap in. Enforce those rules.
photo by me.
okay, we're fighting a fire
And sometimes we will still get to stage 4, fighting a massive outage. But we should
aim to not get here often. Firefighting is not good for your SLAs and it's also not great
for the health of the humans involved.
Image: skeeze. CC0. https://pixabay.com/en/firefighters-training-live-fire-696167/
Jereme Rauckman CC BY 2.0
Ideally we'll get to a point where our firefighters mostly train using controlled outages,
like many real fire departments do. But we're not there yet.
Many of us are still fixing unreliable software by focusing on this fourth stage, with
human response and escape routes...
..., that means they're building tenements. Foul air is coming in through the air
shafts, and it's not somewhere humans should live. Reliability can't be added after
the building is finished. It needs to be built in. Failure needs to be built in.
Building better buildings makes a huge difference.
ARA_-_535469.jpg Public domain.
Well, this helped. This is the New York City fire code. It has 444 pages and costs
$140 dollars, which I know because I really wanted to bring one in here today and
dramatically wave it at everyone.The guy at the library was really confused about why
I'd want a physical copy. He was like "Look, do you have access to the internet?"
And fire safety is also mentioned plenty in the city building code, the city construction
code, the state building code, the National Fire Prevention Agency electrical code and
I’m sure plenty of other dense legislation. Don’t ask me what's in each of these.
There’s a lot of code, that’s all I’m saying.
But we don't have a fire code for software. We have a bunch of O’Reilly books and
they're great. But nothing makes us adhere to our best practices, or prioritises one set
of rules over the others. Why don't we have a fire code yet?
software failure has
killed or injured a
large number of
It is just conceivable
that such a tragedy
could occur." Software: A Vital Key to UK Competitiveness
(C) Crown Copyright 1986
via Risks Digest (https://catless.ncl.ac.uk/Risks)
h/t joe Thompson @caffeinepresent
It has been proposed from time to time!
I found this report from 1986 called "Software: a vital key to UK competitiveness",
which had a whole appendix on safety critical software. It starts with “No computer
software failure has killed or injured a large number of people. It is just
conceivable that such a tragedy could occur.”
"Each life-critical system
must be operated by a
Engineer who is named as
responsible for the
Proposal from the UK
Advisory Council for
Applied Research and
The Advisory Council predicted a time when it wouldn’t be possible to recover from
software failure by just switching off the computer and doing the thing manually -- this
was written in 1986, remember. We're there now. They wanted certification: you
would only be able to operate a life-critical computer system if you had a license and
a Certified Software Engineer to sign off on it -- and they would be personally
liable! -- and a bunch of other stuff, and you'd have to get re-certified every five
They also proposed what’s basically on call shifts, disaster recovery practice drills,
and post-mortems, including post-mortems for near misses. A lot of this feels
prescient and we ended up doing it, but we never required certification.
slide from @jkuroda's
amazing LISA 2017 keynote.
Used with permission.
If you were at LISA in November, you might have seen Jon Kuroda's fantastic closing
keynote about aviation safety. Like buildings, plane travel got safer only after a lot of
Jon pointed out that, while we might think of computing as a new field, it's the same
age as a bunch of others. Software, aviation, power, emergency medicine all took a
big jump forward after world war 2. But our industry is significantly less mature than
any of the others.
Image by me.
The stakes are lower?
Is that because the stakes are lower? It's at least part of the reason. Mostly, the
stakes have have been lower. Software mostly hasn't had the ability to cause
Researching this talk, I read a ton about deaths from software -- it really was a
cheerful time creating this talk -- and found surprisingly few. Most of the new about
software and deaths were about how software is IMPROVING things. By making
processes repeatable and precise, we're saving lives.
But we have had some famously dangerous software bugs.
The stakes are lower?
Ars Technica, August 2013
The Independent, October 1992
New York Times, June 1986
The Therac-25 radiation therapy machine had a concurrent programming bug that
made it occasionally give its patients radiation doses that were hundreds of times
greater than they should have been. Three people died.
In college I remember studying the London Ambulance dispatch failure. A new
software system was deployed that hadn't been load tested, and it had a memory
leak. It couldn't keep track of where the ambulances were, which led to them arriving
hours late. 46 people died who might have been ok if the ambulance had arrived on
And some near misses. Like, I haven't heard of any actual negative outcomes from
the OCR bug that went around in 2013, but you can see how it might print end up with
numbers in prescriptions or structural engineering documents being catastrophically
And the news is full of software concerns in vehicles, self-driving or otherwise.
“"It took a Newark fire
and a Triangle fire to
bring New York State's
fire legislation to its
Inis Weed, New Outlook
volume 104, 1913
But none of those has been our Triangle fire. So far software has been able to kill
people one or a few at a time. We haven’t had the wide-scale disasters that have
shocked other industries into growing up.
Aviation regulations came from a bunch of people dying. Mining regulations came
from a bunch of people dying. Professional engineering organisations came from a
bunch of people dying. To quote my new favourite 1910s journalist, Inis Weed, "It took
a Titanic disaster to improve the safety of vessels. It took a Newark Fire and a
Triangle fire to bring New York State's fire legislation to its present inefficiency".
The use of software for life-critical systems grows every year. And every day we send
#hugops on Twitter to the people working on the latest massive software outage. At
some point these will overlap. Hope is not a strategy.
Are we ready for this kind of responsibility?
We, all of us here, are people who are responsible for software. The world will need a
lot of software over the next few decades. Some people in this room will run life
critical systems. We are 1890s landlords looking at a whole lot of new opportunity. We
know, there's money to be made from cutting all of the corners, but we have a choice.
I don't want us to wait for a disaster...
New Outlook, volume 104. https://books.google.com/books?id=URCzNkpDZp0C
Inis Weed or Inis Weed Jones made topics like medicine, sociology and science
exciting for regular people. She wrote extensively for Harper’s, Schibner’s and the
Reader’s Digest. She lived, at least for a while, at 337 West 22nd St. She wrote tons
about working conditions and humanised anonymous workers. She was an
investigator for the US Commission on Industrial Relations. She wrote articles like
The Reasons Why The Copper Miners Struck (about a strike), and Safer Childbirth
with Less Pain, and Acne: the Plague of Youth and Not By Bread Alone (about young
people returning to farming). She also published a book called "Peetie: the story of a
real cat", which is $72 on abebooks.com and I won't deny, I'm tempted. She reads like
a tremendously compassionate person who wrote about things people needed to care
about in an engaging way and made them care. (Please don't be a milkshake duck,
Let's choose not to build tenements.
...to decide not to build tenements.
Remember, some regulations didn't come from fires! Some came from a lot of people
deciding to care about the same thing at the same time.
We can decide now what good systems look like. We can create professional
standards and industry safety codes, and create and opt in to a professional
organisation to keep ourselves honest. And then, like the fire code, we can keep
revising and improving it until huge software outages are rare and shocking.
The entire industry should learn from every major outage. No secrets.
● Escapes in Urban America: History and
Preservation, Elizabeth Mary Andre
● No exit: the rise and demise of the outside fire
escape: Sara E Wermiel
● How Fire Disaster Shaped the Evolution of the
New York City Building Code, Charles Shelhamer
● The Creative and forgotten fire escape designs of
the 1800s, Lauren Young
● New Outlook vol 104 (May-August 1913)
● RISKS Digest
● 1910 Newark Factory Fire, Mary Alden Hopkins
● New York City (NYC) Disasters, Baruch College
● Presentation template by SlidesCarnival
Find me at @whereistanya
Before I finish: if you're in New York, the NYFD and the Red Cross have a
shared campaign to give people free smoke alarms and free batteries. They'll
even come install it for you. If you don't have a smoke alarm, please search for
#GetAlarmedNYC and fill in their form. http://fw.to/Kzv1G4f
(Two SREs live in my apartment, so we already have two redundant meshes of
networked alarms from different manufacturers and also a few standalone alarms.)
This slide lists a few references that I found especially useful or interesting while
writing this talk. That first one contains a list of all the others, so hit up
http://noidea.dog/fires if you want a lot of links to read more about fires and fire
If you have comments on the talk, or questions or you're a building historian who is
willing to tell me what I got wrong, you can find me at @whereistanya on Twitter or