Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Blameless System Design
Douglas Land
Vast.com, Inc.
Hi, my name is Douglas Land. I'm the director of technical operations ...
I break systems… a LOT
Auth
Syslog
Chef
Ambassadors
Prod Frontends
I break things, A LOT. I've broken authentication acros...
Sometimes I ‘break’ systems on purpose...
Service discovery by chef
90% code in prod
No shared storage for cloudstack
Some...
Higher standards
And yet, I still hold others to a higher standard..
Servers still on public internet???
Created a flat VL...
Technical debtor’s prison
We’re obsessed with technical debt
Qualifying it:
Application Debt
Infrastructure Debt
Architect...
The myth of technical debt
Peter Norvig, “All code is liability”
Not actually technical debt:
● Maintenance
● Changes in u...
So what is technical debt?
Technical debt is the choices we intentionally make to speed up the development
or implementati...
The blame game
Shouldn't we stop blaming people for making the trade-offs they're forced to
make?
So if we acknowledge tha...
Being Blameless
● If we remove fear we will have a more
honest conversation about trade-offs
● if we're honest about those...
What is blameless system design?
Assuming goodwill
Blameless post-mortems
Empathy
Experimentation
Honesty
Communication
So...
Assume Goodwill
Your co-worker probably doesn’t come into work every day with
the intent of harming you or the organizatio...
Blameless Post-mortems
“We must strive to understand that accidents don’t
happen because people gamble and lose.
Accidents...
Empathy
● Reject ‘contempt culture’
● Focus on the positive
● Consider others’ perspectives
You might be sitting next to t...
Experimentation
The Engineering Design Process
Define the Problem
Do Background Research
Specify Requirements
Brainstorm S...
Honesty
● Publish ALL your results
● Document ALL your decisions
● Be honest about trade-offs
● Track mitigations
Publish ...
Communication
● Broadcast expectations
● Honor achievements
● Make doc easy to find
● Open discussions
● Well define feedb...
Did someone say devops?
● Culture
● Measurement
● Sharing
● Feedback loops
If some of this sounds familiar,
it's because i...
The bad
It’s hard to change culture and get away from a retribution
culture and the RCA mentality
It’s hard to get over hi...
The good
● Remove fear
● Encourage ‘risk’
● Create feedback
● Reduce redundant learning
● Improve working environment, tru...
Douglas Land - Director of operations, Vast.com, Inc.
doug@webuilddevops.com | @webuilddevops
Some References:
http://www....
Upcoming SlideShare
Loading in …5
×

Blameless system design - annotated

355 views

Published on

Blameless System Design ignite talk given at Devops Days Austin 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Blameless system design - annotated

  1. 1. Blameless System Design Douglas Land Vast.com, Inc. Hi, my name is Douglas Land. I'm the director of technical operations for a company in town called Vast.com. We do big data and analytics and we're starting a foray into several consumer facing products and I'm here today to present a concept called Blameless System Design. Annotated: sample script in white boxes
  2. 2. I break systems… a LOT Auth Syslog Chef Ambassadors Prod Frontends I break things, A LOT. I've broken authentication across all our servers. I've broken syslog.. just by using it. I've created havoc via chef runs across our whole infrastructure. I'm probably one of the worst offenders of breaking production on my team.
  3. 3. Sometimes I ‘break’ systems on purpose... Service discovery by chef 90% code in prod No shared storage for cloudstack Sometimes you just need do things. And sometimes I 'break' things on purpose. Sometimes you need to make trade-offs to meet your goals and objectives; and you don't have the time or resources to adhere to standards. Sometimes you simply need to get something done as soon as possible regardless of consequences.
  4. 4. Higher standards And yet, I still hold others to a higher standard.. Servers still on public internet??? Created a flat VLAN when we did move to private IPs??? No centralized management of virtualization infrastructure??? The only 'shared storage' is via DRBD and ha.d??? And yet I somehow still hold others to a higher standard than I tend to follow myself. Every time start a new job and encounter a new environment I looked around at the choices that were made, the technical debt that's been generated I think, "What the heck is going on here?" "What are these guys thinking?"
  5. 5. Technical debtor’s prison We’re obsessed with technical debt Qualifying it: Application Debt Infrastructure Debt Architecture Debt Quantifying it: size of code base code coverage coupling and cohesion reports cyclomatic complexity Halstead complexity measures I think we're a little obsessed with technical debt. We spend a lot of time trying to qualify it and quantify it. We try to break it down, measure it, and figure out what the actual cost is and how to improve our software, systems and infrastructure to compensate for it.
  6. 6. The myth of technical debt Peter Norvig, “All code is liability” Not actually technical debt: ● Maintenance ● Changes in understanding ● Operational inertia ● Poor code choices ● Dependency liabilities In the process we end up including many things under that umbrellas which don't have anything to do with technical debt at all. Every platform or service is going to cease to be useful if we don't take the time to maintain it and understand how it's evolved and changed.
  7. 7. So what is technical debt? Technical debt is the choices we intentionally make to speed up the development or implementation of systems, and which we acknowledge will need to be changed later. Technical debt is the result of an Efficiency-Thoroughness Trade-Off at an individual level. Technical debt is the output of a project constraint model at an organizational level. So what is technical debt? I'd qualify it as something intentional.. As something we acknowledge we'll need to change later. At an individual level it's the result of an Efficiency-Thoroughness Trade-Off. At a business level It's the result of constraints like cost and speed.
  8. 8. The blame game Shouldn't we stop blaming people for making the trade-offs they're forced to make? So if we acknowledge that we all need to make trade-offs, either in the name of personal efficiency, cost savings, or time, I think we can also acknowledge that none of us want to make those trade-offs; they're artifacts of the environment we work in. We shouldn't be blamed for them.
  9. 9. Being Blameless ● If we remove fear we will have a more honest conversation about trade-offs ● if we're honest about those trade-offs crisis might be averted altogether ● If we understand our history, we won't be destined to repeat it Being 'blameless' has, in fact proven to be beneficial to business. If you're not afraid of retribution, you're more likely to be honest. The more honest you are, the more everyone can learn about all kinds of situations, and the more we learn about things, the more opportunity we have to improve.
  10. 10. What is blameless system design? Assuming goodwill Blameless post-mortems Empathy Experimentation Honesty Communication So what is blameless system design? It's basically trying to look at things through others' eyes, and to give everyone as much context as possible about any decisions being made. Since we in the tech community like acronyms, I also tried to make a handy one. So Blameless System Design is A-BEECH.
  11. 11. Assume Goodwill Your co-worker probably doesn’t come into work every day with the intent of harming you or the organization. *Most* people aren’t trying to cause issues... It's important to think about the fact that everyone is generally trying to do the best job they can and to start decisions and discussions from that perspective. It's important to remember that, if someone makes a mistake, it's from a place of misunderstanding, not malice.
  12. 12. Blameless Post-mortems “We must strive to understand that accidents don’t happen because people gamble and lose. Accidents happen because the person believes that: …what is about to happen is not possible, …or what is about to happen has no connection to what they are doing, …or that the possibility of getting the intended outcome is well worth whatever risk there is.” - Erik Hollnagel While blameless system design isn't error focused it's important to have a framework in place when there are issues. Blameless retrospectives remove fear from the process and encourage people to improve the system instead of seeks retribution, which is important for a high- functioning team.
  13. 13. Empathy ● Reject ‘contempt culture’ ● Focus on the positive ● Consider others’ perspectives You might be sitting next to the person who had to make the tough call you’re critiquing. Someday, that person might be you. Rather than jumping to judgements, it's important try to understand how someone might have arrived at their narrative and how that might have shaped the decisions they made.
  14. 14. Experimentation The Engineering Design Process Define the Problem Do Background Research Specify Requirements Brainstorm Solutions Choose the Best Solution Do Development Work Build a Prototype Test and Redesign No system lives in isolation and complex system interactions can cause some very unexpected behavior. without experiments, we have no way to qualify our assumptions about those interactions.This is why it's so important to measure and record everything. Design your experiments, don’t be a victim of them.
  15. 15. Honesty ● Publish ALL your results ● Document ALL your decisions ● Be honest about trade-offs ● Track mitigations Publish all your experiments and results whether they met your expectations or not. Document your decisions somewhere so future reviewers will understand them. Be explicit in the docs about issues you came across and how you addressed them. Be honest about trade-offs.
  16. 16. Communication ● Broadcast expectations ● Honor achievements ● Make doc easy to find ● Open discussions ● Well define feedback channels Broadcasts cultural expectations throughout the organization, repeatedly if needed. Open up meetings and discussions to anyone who wants to participate, they just might provide unexpected insight. Clearly define both positive and negative feedback channels so everyone knows how to provide input.
  17. 17. Did someone say devops? ● Culture ● Measurement ● Sharing ● Feedback loops If some of this sounds familiar, it's because it is. Blameless system design includes many of the attributes of devops in general. A huge part of devops is culture and hopefully some of this might be actionable for people trying to address that inside their organization.
  18. 18. The bad It’s hard to change culture and get away from a retribution culture and the RCA mentality It’s hard to get over hindsight bias. It’s a lot of work to encourage openness and honesty, and define what that looks like. It’s hard to get over their impostor syndrome and / or contempt cultures. It's hard to change an organization's culture It's effectively asking an organization to accept risk; risk of the unknown. And depending on the organization, that can be a little like steering the titanic. You really need to co-opt your boss and have him co-opt his boss, it's turtles all the way up.
  19. 19. The good ● Remove fear ● Encourage ‘risk’ ● Create feedback ● Reduce redundant learning ● Improve working environment, trust But if you can pull it off and removes fear as an obstacle to innovation, encourages people to take risks, which could lead to differentiation as a business, create better feedback loops, improve data flow, and create more trust at every level of your organization I think you'll find it well worth the effort.
  20. 20. Douglas Land - Director of operations, Vast.com, Inc. doug@webuilddevops.com | @webuilddevops Some References: http://www.datical.com/blog/technical-debt-devops/ http://laughingmeme.org/2016/01/10/towards-an-understanding-of-technical-debt/ http://blog.aurynn.com/86/contempt-culture http://erikhollnagel.com/ideas/etto-principle/index.html http://indecorous.com/fallible_humans/ https://hbr.org/2003/05/it-doesnt-matter/ar/pr https://codeascraft.com/2014/07/18/just-culture-resources/ http://sidneydekker.com/just-culture/ I'd love to say we're at the end of the journey to blameless system design, but like many things I suspect this is not a destination, and we're still a work in progress. But thanks to everyone who has contributed to the work I've sites we're making progress day by day. Thank you.

×