1. Blameless System Design
Douglas Land
Vast.com, Inc.
Hi, my name is Douglas Land. I'm the director of technical operations for a company in town called Vast.com. We do big
data and analytics and we're starting a foray into several consumer facing products and I'm here today to present a
concept called Blameless System Design.
Annotated: sample script in white boxes
2. I break systems… a LOT
Auth
Syslog
Chef
Ambassadors
Prod Frontends
I break things, A LOT. I've broken authentication across all our servers. I've broken syslog.. just by using it. I've created
havoc via chef runs across our whole infrastructure. I'm probably one of the worst offenders of breaking production on my
team.
3. Sometimes I ‘break’ systems on purpose...
Service discovery by chef
90% code in prod
No shared storage for cloudstack
Sometimes you just need do things.
And sometimes I 'break' things on purpose. Sometimes you need to make trade-offs to meet your goals and objectives;
and you don't have the time or resources to adhere to standards. Sometimes you simply need to get something done as
soon as possible regardless of consequences.
4. Higher standards
And yet, I still hold others to a higher standard..
Servers still on public internet???
Created a flat VLAN when we did move to private IPs???
No centralized management of virtualization infrastructure???
The only 'shared storage' is via DRBD and ha.d???
And yet I somehow still hold others to a higher standard than I tend to follow myself. Every time start a new job and
encounter a new environment I looked around at the choices that were made, the technical debt that's been generated I
think, "What the heck is going on here?" "What are these guys thinking?"
5. Technical debtor’s prison
We’re obsessed with technical debt
Qualifying it:
Application Debt
Infrastructure Debt
Architecture Debt
Quantifying it:
size of code base
code coverage
coupling and cohesion reports
cyclomatic complexity
Halstead complexity measures
I think we're a little obsessed with technical debt. We spend a lot of time trying to qualify it and quantify it. We try to break it
down, measure it, and figure out what the actual cost is and how to improve our software, systems and infrastructure to
compensate for it.
6. The myth of technical debt
Peter Norvig, “All code is liability”
Not actually technical debt:
● Maintenance
● Changes in understanding
● Operational inertia
● Poor code choices
● Dependency liabilities
In the process we end up including many things under that umbrellas which don't have anything to do with technical debt at
all. Every platform or service is going to cease to be useful if we don't take the time to maintain it and understand how it's
evolved and changed.
7. So what is technical debt?
Technical debt is the choices we intentionally make to speed up the development
or implementation of systems, and which we acknowledge will need to be
changed later.
Technical debt is the result of an Efficiency-Thoroughness Trade-Off at an
individual level.
Technical debt is the output of a project constraint model at an organizational
level.
So what is technical debt? I'd qualify it as something intentional.. As something we acknowledge we'll need to change
later. At an individual level it's the result of an Efficiency-Thoroughness Trade-Off. At a business level It's the result of
constraints like cost and speed.
8. The blame game
Shouldn't we stop blaming people for making the trade-offs they're forced to
make?
So if we acknowledge that we all need to make trade-offs, either in the name of personal efficiency, cost savings, or time, I
think we can also acknowledge that none of us want to make those trade-offs; they're artifacts of the environment we work
in. We shouldn't be blamed for them.
9. Being Blameless
● If we remove fear we will have a more
honest conversation about trade-offs
● if we're honest about those trade-offs
crisis might be averted altogether
● If we understand our history, we won't be
destined to repeat it
Being 'blameless' has, in fact proven to be beneficial to business. If you're not afraid of retribution, you're more likely to be
honest. The more honest you are, the more everyone can learn about all kinds of situations, and the more we learn about
things, the more opportunity we have to improve.
10. What is blameless system design?
Assuming goodwill
Blameless post-mortems
Empathy
Experimentation
Honesty
Communication
So what is blameless system design? It's basically trying to look at things through others' eyes, and to give everyone as
much context as possible about any decisions being made. Since we in the tech community like acronyms, I also tried to
make a handy one. So Blameless System Design is A-BEECH.
11. Assume Goodwill
Your co-worker probably doesn’t come into work every day with
the intent of harming you or the organization.
*Most* people aren’t trying to cause issues... It's important to think about the fact that everyone is generally trying to
do the best job they can and to start decisions and discussions from that perspective. It's important to remember
that, if someone makes a mistake, it's from a place of misunderstanding, not malice.
12. Blameless Post-mortems
“We must strive to understand that accidents don’t
happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to
what they are doing,
…or that the possibility of getting the intended
outcome is well worth whatever risk there is.”
- Erik Hollnagel
While blameless system design isn't error focused it's
important to have a framework in place when there are
issues. Blameless retrospectives remove fear from the
process and encourage people to improve the system
instead of seeks retribution, which is important for a high-
functioning team.
13. Empathy
● Reject ‘contempt culture’
● Focus on the positive
● Consider others’ perspectives
You might be sitting next to the person who had to make the tough call you’re critiquing. Someday, that person might be
you. Rather than jumping to judgements, it's important try to understand how someone might have arrived at their
narrative and how that might have shaped the decisions they made.
14. Experimentation
The Engineering Design Process
Define the Problem
Do Background Research
Specify Requirements
Brainstorm Solutions
Choose the Best Solution
Do Development Work
Build a Prototype
Test and Redesign
No system lives in isolation and complex system interactions can cause some very unexpected behavior. without
experiments, we have no way to qualify our assumptions about those interactions.This is why it's so important to measure
and record everything. Design your experiments, don’t be a victim of them.
15. Honesty
● Publish ALL your results
● Document ALL your decisions
● Be honest about trade-offs
● Track mitigations
Publish all your experiments and results whether they met your expectations or not. Document your decisions somewhere
so future reviewers will understand them. Be explicit in the docs about issues you came across and how you addressed
them. Be honest about trade-offs.
16. Communication
● Broadcast expectations
● Honor achievements
● Make doc easy to find
● Open discussions
● Well define feedback
channels
Broadcasts cultural expectations throughout the organization, repeatedly if needed. Open up meetings and discussions to
anyone who wants to participate, they just might provide unexpected insight. Clearly define both positive and negative
feedback channels so everyone knows how to provide input.
17. Did someone say devops?
● Culture
● Measurement
● Sharing
● Feedback loops
If some of this sounds familiar,
it's because it is. Blameless
system design includes many of
the attributes of devops in
general. A huge part of devops is
culture and hopefully some of this
might be actionable for people
trying to address that inside their
organization.
18. The bad
It’s hard to change culture and get away from a retribution
culture and the RCA mentality
It’s hard to get over hindsight bias.
It’s a lot of work to encourage openness and honesty, and
define what that looks like.
It’s hard to get over their impostor syndrome and / or contempt
cultures.
It's hard to change an organization's culture It's effectively asking an organization to accept risk; risk of the unknown. And
depending on the organization, that can be a little like steering the titanic. You really need to co-opt your boss and have him
co-opt his boss, it's turtles all the way up.
19. The good
● Remove fear
● Encourage ‘risk’
● Create feedback
● Reduce redundant learning
● Improve working environment, trust
But if you can pull it off and removes fear as an obstacle to innovation, encourages people to take risks, which could lead to
differentiation as a business, create better feedback loops, improve data flow, and create more trust at every level of your
organization I think you'll find it well worth the effort.
20. Douglas Land - Director of operations, Vast.com, Inc.
doug@webuilddevops.com | @webuilddevops
Some References:
http://www.datical.com/blog/technical-debt-devops/
http://laughingmeme.org/2016/01/10/towards-an-understanding-of-technical-debt/
http://blog.aurynn.com/86/contempt-culture
http://erikhollnagel.com/ideas/etto-principle/index.html
http://indecorous.com/fallible_humans/
https://hbr.org/2003/05/it-doesnt-matter/ar/pr
https://codeascraft.com/2014/07/18/just-culture-resources/
http://sidneydekker.com/just-culture/
I'd love to say we're at the end of the
journey to blameless system design,
but like many things I suspect this is
not a destination, and we're still a
work in progress. But thanks to
everyone who has contributed to the
work I've sites we're making progress
day by day. Thank you.
Editor's Notes
• ❑ name
• ❑ title
• ❑ company
• ❑ about talk
Intro: name, occupation
Broke ALL OF auth
Broke syslog by.. using it
Broken all chef runs innumerable times
Broke FE by turning back up some old nodes not properly decommissioned
Broke our ambassador setup with some bad template logic
https://i.ytimg.com/vi/GTkcjjt2TBY/maxresdefault.jpg
I ship 90% code which sometimes makes it into production
I hide a LOT of things behind config management that shouldn't be handled at that level
I decided to deploy our private cloud with no shared storage
I decided to attack service discovery with chef vs making devs register applications
Sometimes we make decisions we know are mistakes in the name of moving forward.
http://paragondsi.com/wp-content/uploads/2015/06/office-space.jpg
What were people thinking??? Why are they leaving all this technical debt behind??
we all constantly talking about and trying to quantify technical debt
Application Debt – Debt that resides in the software package
Infrastructure Debt – Debt that resides in the operating environments
Architecture Debt – Debt that resides in the design of the entire system
measuring technical debt
size of code base
code coverage
coupling and cohesion reports
cyclomatic complexity
Halstead complexity measures
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/William_Hogarth_018.jpg/1239px-William_Hogarth_018.jpg
Rather, there is ONLY technical debt -
Kellan Elliott-McCrea Former CTO of Etsy - towards-an-understanding-of-technical-debt: "Technical debt is the choices we made in our code, intentionally, to speed up development today, knowing we’d have to change them later. "
things ascribed to technical debt are just facets of creating software: maintenance, change in understanding,
instead of treating it like an exception, we should just embrace it
http://cattype.deviantart.com/art/Tsunami-Relief-Fund-216541678
No one *wants* not to do their job well.
We’ve all had to make trade offs to balance priorities
Fast, cheap, good - the only people who can beat the good, fast, cheap triangle can't even be running a business
As Erik Hollnagel stated, "The ETTO [ Efficiency-Thoroughness Trade-Off ] fallacy is that people are required to be both efficient and thorough at the same time – or rather to be thorough when with hindsight it was wrong to be efficient!"
The more complex a system the higher likelihood of failure
Shouldn't we stop blaming people for making the tradeoffs they're forced to make?
https://www.flickr.com/photos/cafuego/12575046354
etsy has done a great job bringing 'just culture' to postmortems, but that can be expanded beyond the scope of issues
There are trade-offs in EVERY system design
Restorative vs punative model
If we remove fear we will have a more honest conversation about those tradeoffs
if we're honest about those tradeoffs crisis might be averted all together
If we understand our history, we won't be destined to repeat it
https://upload.wikimedia.org/wikipedia/commons/8/8c/Tumbeasts_servers.png
https://upload.wikimedia.org/wikipedia/commons/4/49/Smurf_Zombies_-_Flickr_-_SoulStealer.co.uk.jpg
blameless system design is a beech
Most people aren’t trying to bring about computergeddon.
Bring empathy to the table when you’re discussing someone’s design.
Has tooling improved? Did that shiny OSS project that will fix all of this ‘mess’ even exist in a production ready state when this was implemented?
What logic might have lead to this design choice?
Put yourself in their shoes.
https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Goodwill_Industries_Logo.svg/341px-Goodwill_Industries_Logo.svg.png
while not error focused it's important to have a constructive framework in place when there are problems
ensures balanced accountability for both individuals and the organization
analyses errors, not judges people
removes fear from the process
encourages people to improve the system instead of seeks retribution
https://upload.wikimedia.org/wikipedia/commons/a/af/Aachen_Allegory.jpg
You might be sitting next to the person who had to make the tough call you’re critiquing. Someday, that person might be you
Reject 'contempt culture' and the trading of condescension for prestige
try to understand how someone might have arrived at their self-taught narrative and how that might have shaped decisions
focus on the good qualities of a design and see if those can be extended or applied other places
https://upload.wikimedia.org/wikipedia/commons/8/85/Mother's_love.jpg
No system lives in isolation
Without experiments, we have no way to qualify our assumptions about those interactions.
Measure Measure Measure and record!
We deal with complex system interactions that can cause some very unexpected behavior.
Record metrics at every step with every change to qualify your work
design your experiments, don’t be a victim of them.
https://upload.wikimedia.org/wikipedia/commons/e/e7/Atomic_Laboratory_Experiment_on_Atomic_Materials_-_GPN-2000-000663.jpg
Publish all your experimentation results whether they bore fruit or not
Document your decisions somewhere so future reviewers will understand them.
Save future reviewers / architects some time by being explicit about issues you came across and how you addressed them.
Be honest about trade-offs, this is not the place to be shy about the skeletons in the closet
track mitigation responses, at least in a backlog, so they don't get buried over time to later re-emerge from their graves
https://www.flickr.com/photos/rosengrant/3929869118
broadcasts cultural expectations throughout the organization
reinforce our organization with respect or a sense of achievement
provide easy to find and access information about all systems
open up meetings and discussions to anyone who wants to participate, they just might provide unexpected insight
establish both positive and negative feedback channels
https://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Communication_shannon-weaver2.svg/2000px-Communication_shannon-weaver2.svg.png
if some of this sounds familiar, it's because it is
blameless system design includes many of the skills of the devops movement
We've got the CMS in CAMS
Culture
Measurement
Sharing
creates feedback loops
http://www.bouwkennisblog.nl/wp-content/uploads/2014/04/luisteren.jpg
hard to change retribution culture and the RCA mentality
hard to get over hindsight bias
It's a lot of work!
championing efforts
encouraging openness
defining what is broadcast
everyone will need to get over their impostor syndrome and / or contempt cultures
the organization must be willing to accept risk
risk from new system design and complexity
risk from choosing to leave old systems in place
risk from updating old systems
once risk has caused failure, organizations must be willing to try restorative measures (and not break trust)
organizations must be willing to be honest and frank about both the good and the bad aspects of their systems
https://pixabay.com/static/uploads/photo/2013/07/13/10/32/bad-157437_960_720.png
Why do this?
removes fear as an obstacle to innovation
encourages people to take risks, which could lead to differentiation as a business
creates good feedback loops to increase iterations
creates good data to prevent 'retracing each other's steps'
improves the working environment and relationships
https://pixabay.com/static/uploads/photo/2013/07/13/10/32/good-157436_960_720.png