Despite our best efforts with Agile best practices -- CI, unit testing, design reviews, code reviews -- every few years we end up rewriting our software.
This talk is about my personal experiences with project failure, improvement failure, patterns across the industry, and the key lessons that were critical to success. I had a great team. We were disciplined with best practices and spent tons of time working on improvements. Yet still, I watched my team slam into a brick wall. We brought down production three times in a row, then couldn't ship again for another year.
We thought our problems were caused by technical debt building up in the code base, but we were wrong. We failed to improve, because we didn't solve the right problems. Eventually, we turned our project around, but with a lot of tough lessons along the way.
When I started consulting, I saw the same patterns across the industry:
Our problems are invisible. They're hard to measure, hard to explain. We rely primary on gut feel to make improvement decisions, and it's really easy to make improvements that don't make much difference.
The relentless business pressure never lets up, and developers don't have control. Without visibility, management makes ill-informed decisions, train-wrecks our projects, and drives the development team to exhaustion.
I turned the lessons I learned into the Idea Flow Learning Framework, a data-driven feedback loop for guiding improvements. By quantifying the impact of disruptions, test maintenance, confusing code, and collaboration problems, we can find our biggest problems, understand the causes, and align our development priorities with leadership.
13. The further down the path,
the more tempting it is...
With invisible problems and business pressure that doesn’t let up,
we just keep repeating the same cycle.
RESET
15. RESET
“A description of the goal is not a strategy.”
-- Richard P. Rumelt
What’s wrong with our current strategy?
16. Our “Strategy” for Success
High Quality Code
Low Technical Debt
Easy to Maintain
Good Code Coverage
17. RESET
“A good strategy is a specific and coherent response to—
and approach for overcoming—the obstacles to progress.”
-- Richard P. Rumelt
The problem is we don’t have a strategy...
18. RESET
So what are the biggest obstacles that
cause the software rewrite cycle?
What specific approach are we going to take
to overcome the obstacles?
26. The Amount of PAIN was Driven By...
Likeliness(of((
Unexpected(
Behavior(
Cost(to(Troubleshoot(and(Repair(
High(Frequency(
Low(Impact(
Low(Frequency(
Low(Impact(
Low(Frequency(
High(Impact(
PAIN(
27. What Causes Unexpected
Behavior (likeliness)?
What Makes Troubleshooting
Time-Consuming (impact)?
What causes PAIN?
Familiarity Mistakes
Stale Memory Mistakes
Semantic Mistakes
Bad Input Assumption
Tedious Change Mistakes
Copy-Edit Mistakes
Transposition Mistakes
Failed Refactor Mistakes
False Alarm
Non-Deterministic Behavior
Ambiguous Clues
Lots of Code Changes
Noisy Output
Cryptic Output
Long Execution Time
Environment Cleanup
Test Data Generation
Using Debugger
28. What Causes Unexpected
Behavior (likeliness)?
What Makes Troubleshooting
Time-Consuming (impact)?
What causes PAIN?
Familiarity Mistakes
Stale Memory Mistakes
Semantic Mistakes
Bad Input Assumption
Tedious Change Mistakes
Copy-Edit Mistakes
Transposition Mistakes
Failed Refactor Mistakes
False Alarm
Non-Deterministic Behavior
Ambiguous Clues
Lots of Code Changes
Noisy Output
Cryptic Output
Long Execution Time
Environment Cleanup
Test Data Generation
Using Debugger
29. What Causes Unexpected
Behavior (likeliness)?
What Makes Troubleshooting
Time-Consuming (impact)?
What causes PAIN?
Familiarity Mistakes
Stale Memory Mistakes
Semantic Mistakes
Bad Input Assumption
Tedious Change Mistakes
Copy-Edit Mistakes
Transposition Mistakes
Failed Refactor Mistakes
False Alarm
Once we understood causes, most problems were avoidable.
Non-Deterministic Behavior
Ambiguous Clues
Lots of Code Changes
Noisy Output
Cryptic Output
Long Execution Time
Environment Cleanup
Test Data Generation
Using Debugger
31. 7:01
Iterative Validation with Unit Tests
7:010:00
14:230:00
Skipping Tests and Validating at the End
Urgency Leads to High-Risk Decisions
If I make no mistakes I save ~2 hours.
If I make several mistakes I lose ~8 hours.
An Avoidable Problem
38. How much work does it take to
complete a software task?
=
Side-Effects from
Ignoring the Risk
Trade-off Decisions
Writing Unit Tests Troubleshooting Mistakes
Direct Cost Indirect Costs
or
We make high-risk decisions because the indirect costs are hard to quantify.
Likeliness of Event
Potential Impact
x=
Risk
48. Case Study 1:
Healthy project about 10 months old
Troubleshooting
Progress
Learning
Rework10-20% friction
Effects of Escalating Risk
49. Case Study 2:
Thrashing project about 18 months old
Troubleshooting
Progress
Learning
Rework40-60% friction
0:00 28:15
12:230:00
Effects of Escalating Risk
50. Case Study 3:
Post-meltdown project about 12 years old
Troubleshooting
Progress
Learning
Rework60-90% friction
7:070:00
0:00 19:52
Effects of Escalating Risk
51. Case Study 1
Case Study 2
Case Study 3
1 day
2 days
1 day
3 days
1 day
3 days
We can’t see these effects by measuring velocity or task lead-time.
Effects of Escalating Risk
54. Developers are stuck because they don’t have control.
What are we supposed to do?
Management is stuck because they don’t have visibility.
Organizational Deadlock
55. 1. Don’t ask for Permission
2. State your Goal
"I want to make the business case to management for fixing things around
here. No more chaos and working on weekends, this needs to stop. But I
need data to make the case so I need everyone's help."
3. State the Plan
"Here's what I'm thinking. I want to run an experiment to record data for one
month on all the time we spend troubleshooting. We can look at the data
together and identify our biggest problems, then I’ll write it up and present
the case to management to get things fixed.”
4. Enlist the Team
“Will you guys help me make this happen?”
Make the Decision to Lead
56. 1. Don’t ask for Permission
2. Make the Goal Clear to Your Team
"I want to make the business case to management for fixing things around
here. No more chaos and working on weekends, this needs to stop. But I
need data to make the case so I need everyone's help."
3. State the Plan
"Here's what I'm thinking. I want to run an experiment to record data for one
month on all the time we spend troubleshooting. We can look at the data
together and identify our biggest problems, then I’ll write it up and present
the case to management to get things fixed.”
4. Enlist the Team
“Will you guys help me make this happen?”
Make the Decision to Lead
57. 1. Don’t ask for Permission
2. Make the Goal Clear to Your Team
"I want to make the business case to management for fixing things around
here. No more chaos and working on weekends, this needs to stop. But I
need data to make the case so I need everyone's help."
3. State the Plan
"Here's what I'm thinking. I want to run an experiment to record data for one
month on all the time we spend troubleshooting. We can look at the data
together and identify our biggest problems, then I’ll write it up and present
the case to management to get things fixed.”
4. Enlist the Team
“Will you guys help me make this happen?”
Make the Decision to Lead
58. 1. Don’t ask for Permission
2. Make the Goal Clear to Your Team
"I want to make the business case to management for fixing things around
here. No more chaos and working on weekends, this needs to stop. But I
need data to make the case so I need everyone's help."
3. State the Plan
"Here's what I'm thinking. I want to run an experiment to record data for one
month on all the time we spend troubleshooting. We can look at the data
together and identify our biggest problems, then I’ll write it up and present
the case to management to get things fixed.”
4. Enlist the Team
“Will you guys help me make this happen?”
Make the Decision to Lead
59. Share the PAIN!
Over the last month,
we’ve spent ~50% of our time troubleshooting
Troubleshooting JavaScript Errors
60. Share the PAIN!
With 25 developers on the team, rushing to save a couple hours
caused over 1000 hours of developer downtime
64. Take Responsibility
Dedicated resources (1 or 2 developers)
Control of all decisions for improvement work
Make a commitment to show results in 3 months
Ask for a 3-Month Trial
66. Idea Flow Learning Framework
“Idea Flow” is a metaphor for the
human interaction in software development
Idea Flow Map
67. Idea Flow Learning Framework
Troubleshooting Learning Rework
Strategy for improving software predictability
by optimizing developer experience.
68. Input:
Task + Constraints
Target: Optimal Idea Flow
Output: Actual Friction
Need a Feedback Loop...
69. Input:
Task + Constraints
Target: Optimal Idea Flow
Output: Actual Friction
1.
Visibility
2.
Clarity
3.
Awareness
Mentorship Process
70. Idea Flow Learning Framework
1. Make the Problems Visible
2. Understand the Problems
3. Improve Decision Habits
Strategy for improving software predictability
by optimizing developer experience.
71. 1. Make the Problems Visible
2. Understand the Problems
3. Improve Decision Habits
Idea Flow Learning Framework
Strategy for improving software predictability
by optimizing developer experience.
78. 3. Reflect on Decisions
Learn to Read the Visual Indicators in Idea Flow Maps
Le#$Atrium$
Le#$Ventricle$
Right$Ventricle$
Right$Atrium$
What’s$causing$this$pa7ern?$
Similar to how an EKG helps doctors diagnose heart problems...
79. ...Idea Flow Maps help developers diagnose software problems.
Problem-Solving
Machine
Learn to Read the Visual Indicators in Idea Flow Maps
3. Reflect on Decisions
85. Depending on where the disruptions are in the process,
we see a different effect in Idea Flow.
86. "How did you evaluate the possible options and choose a strategy?"
"What was wrong with the different strategies you tried?"
"What was the discovery that made you choose a different direction?"
Problems with Evaluating Alternatives
87. "Were you working with something that you were unfamiliar with?"
"Did you run into code that was noisy, ambiguous, or misleading?"
"What do you think made it difficult to learn?"
Problems with Modeling
88. "Did your task involve changes to complex code or business rules?"
"Were there a lot of details that you had to keep in your head?"
"What was causing the complexity in the validation cycles?"
Problems with Refining
89. "What experiments did you run to troubleshoot the problem?"
"How many times did you run the experiment?"
"How long did it take to get through each experiment cycle?"
Problems with the Validation Cycle
90. "Was there something in the code that made these changes especially mistake-prone?"
"How familiar were you with the language and tools you were working with?"
"Were you tired or distracted when you did the work?"
Problems with Execution
91. 1. Make the Problems Visible
The goal of visibility is to ask the right questions
so we can identify problems with our strategy.
93. Idea Flow Learning Framework
1. Make the Problems Visible
2. Understand the Problems
3. Improve Decision Habits
Strategy for improving software predictability
by optimizing developer experience.
114. Create a Vocabulary of Patterns
Pain
Types
Problem
Types
Strategy
Types
Symptoms Causes Fixes
2. Breakdown the Patterns
115. 2. Breakdown the Patterns
What Causes Experiment Pain?
100 hours
50 hours
Create a Vocabulary of Patterns
116. Why did it take an hour
to find a typo?
Slow Feedback Loop - Ran experiments by firing up application and sending email
Numerous Feedback Loops - Problem was tricky to trackdown, sent 23 emails
Tags: #ExperimentPain
What Causes Experiment Pain?
118. What Strategies Do We Use?
Iterative Unit Testing
Vertical Slices
Talk to the Familiar
Incremental Integration Test
Intelligent Logging
Isolate Hard-to-Test Code
Tight Control
State Controller
Front Load the Risk
121. How do we know if we’re making things better?
122. What does “better” really mean?
“Better” following best practices
“Better” solving the problems
Best Practices
(solution-focused)
Decision Principles
(problem-focused)
123. What’s a “Decision Principle”?
Answers two questions:
How do we evaluate our situation?
What are we trying to optimize for?
124. Decision Principles
“You know that thing that happens when you make too many changes at once
and it's really hard to troubleshoot problems? We want to avoid that.”
“If I decide to skip the unit tests, how will that affect my haystack size?”
Haystack Effect
The number of unvalidated changes
has a huge impact on the
difficulty of tracking down a problem.
125. “You know when you run an experiment and it’s really hard to figure out what’s
going on? We want to avoid that.”
“How can I design my code sandwich to make this easier to troubleshoot?”
Setup the Inputs
Interpret the Outputs
Code Sandwich
By reducing the
thickness of my code sandwich
I can reduce diagnostic difficulty.
Decision Principles
126. We can learn what works by running
“Strategy Experiments”
127. 3. Run Strategy Experiments
1.
2.3.
1. Decide on a Strategy
2. Predict the effects in Idea Flow
3. Explain the results
Strategy(
Predic-on(Explana-on(
128. 3. Run Strategy Experiments
“I’m going to use Iterative Unit Testing to mitigate the risk of a big haystack."
Prediction: My strategy of #IterativeUnitTesting seemed successful,
troubleshooting should take <15m
129. 3. Run Strategy Experiments
Prediction: My strategy of #IterativeUnitTesting seemed successful,
troubleshooting should take <15m
Explain Why? Situation Pattern:
#HeavyIntegrationLogic
“I’m going to use Iterative Unit Testing to mitigate the risk of a big haystack."
131. 2. Understand the Problems
The goal of understanding the problems
is to identify the “critical few”
that will make the biggest difference.
Pareto’s Law - (80/20 Rule)
132. Idea Flow Learning Framework
1. Make the Problems Visible
2. Understand the Problems
3. Improve Decision Habits
Strategy for improving software predictability
by optimizing developer experience.
133. Clarity without Awareness...
“Well that was a stupid thing to do. Clearly, I shouldn’t have created a big
haystack because troubleshooting would obviously be difficult.”
We have to
be aware in the moment
we make the decision.
134.
135. Stop and Think
Practice asking the right question at the right time
“What does your code sandwich look like for this experiment?”
“What are you planning to observe in order to understand the behavior?”
“What parts of the system are you trying to manipulate in order to see
variations in behavior?”
Situation
Developer struggling with an experiment
(poke the developer)
Ask Questions
137. 3. Improve Decision Habits
The goal of improving decision habits is to
make the improvements permanent.
138. Input:
Task + Constraints
Target: Optimal Idea Flow
Output: Actual Friction
1.
Visibility
2.
Clarity
3.
Awareness
Idea Flow Learning Framework
(Software Control = Predictability)
139. So how do we break the software rewrite cycle?
Start%
Over%
Unmaintainable%
So0ware%
We Learn.
140. Thank you!
If you liked the talk,
please tweet about #ideaflow!
Free e-book if you sign up
by July 15th!
@janellekz
janelle@newiron.com
Twitter:
Email:
Editor's Notes
I’ve been a developer for about 15 years, then got into consulting, and now I’m CTO@New Iron, a software-niche recruiting company that specializes in technical assessment and software mentorship.
Despite our best efforts with agile best practices, every few years we end up sitting around a conference table, talking about what went wrong. Wondering, should we scrap the entire system and start over? Or just deal with the problems and keep on going? Who’s been there!? It’s a pattern across our industry. Why don’t we get the results we’re looking for? We’ve been talking about the same problems for decades. How many of you guys are working on a rewrite effort right now?
How many of you guys are working on a project that’s a rewrite of a project before? Look around the room. Does it surprise you that so many hands are up? Why not?
First of all, I’m not saying you shouldn’t rewrite your software. The cause of your problems may not be what you think. And the good thing about turning a project around instead of starting over is that you already know what your problems are.
So in this talk, I’m going to tell you my story of project failure, what I learned from how to turn the project around, and the major lessons I’ve learned which I’ve been working on codifying into a learning framework - which I could title, how to figure out what your problems are.
If we don’t make time to deal with emerging risks and emerging risks, we will never get out of this cycle.
If we don’t make time to deal with emerging risks and emerging risks, we will never get out of this cycle.
If we don’t make time to deal with emerging risks and emerging risks, we will never get out of this cycle.
If we don’t make time to deal with emerging risks and emerging risks, we will never get out of this cycle.
We were building this factory automation system that was responsible for detecting manufacturing problems then shutting down the tool responsible.
I’d been on the project about 6 months, we were working through the final testing of a major release, tied a bow on it, shipped to production.
Later that night we were on this conference call with IT. And I hear this guy just screaming in the background. Apparently, we had shut down every tool in the factory.
So we rolled back the release and tried to figure out what happened. There was a configuration change that didn’t quite make it to production.
We all felt terrible, but there wasn’t much we could do at this point. So we fixed the problem, and shipped to production... again.
Back on the conference call with IT. And guess what... the same thing happened. What were we supposed to say… oops?
So once again, we rolled back the release.
We couldn’t reproduce the problem. We spent months trying to figure it out, and we were just completely stumped. We tried everything we could think of.
Meanwhile, our development team was pretty much idle so management just told them to go ahead with the next release.
Everyone was just working like nothing was wrong, but we couldn’t ship anything. We had a whole nother release in the queue before we finally figured it out. Guess what was wrong?
We were scared to death to try again, but we didn’t really have a choice. So we cross our fingers, and shipped to production again.
Back on the conference call with IT. We were all watching these real time activity charts and holding our breath. Finally everything seemed to be ok.
I was so relieved that things would finally be back to normal again. And then about 3am, my phone rang.
It was my team lead calling... he asked me about some code that I’d wrote and I knew exactly what happened.
I’d made some “improvements”, refactored a few things, rearranged the design. Made the code better. I did TDD! Apparently I introduced a memory leak too. My changes ground the system to a screeching halt.
Only this time the rollback failed. Not only did I bring the system down, we couldn’t get it out of production.
We didn’t end up shipping again for another year. Our customers were scared to install our software. How could you blame them.
Fully ramped semiconductor factory. 50k wafer starts a day. Completely offline. My fault.
I felt so horrible. I was in my bosses office, just sobbing.
What are the questions we ought to be asking ourselves?
Because we were all so focused on technical debt.
Here’s what I’ve learned in turning a project around. I’ve been working on codifying what I’ve learned into a learning framework. I’ve got a limited number of data points right now, so I’m looking for guinea pigs that would be willing to try it. It’s not easy, but it works.
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,
We’d always get a different set of bugs. What would you do?
I thought the main obstacle was all the technical debt building up in the code base that was causing us to make mistakes.
and if we made changes in the code that had more technical debt, we’d be more likely to make mistakes.
So I got this idea to build a tool that could detect high-risk changes, and tell us where we needed to do more testing -- but what I found wasn’t what I found wasn’t what I expected at all.
Our bugs were mostly in the code written by the senior engineers on the team where the design actually got the most scrutiny. It’s not like we didn’t have any awful crufty code -- but that’s not where the bugs were.
The correlation I did find in the data was this...
[read]
And while that made some sense, I couldn’t help but think, there had to be more to the story...
When I had to work with complex code, it was really painful.
[read]
So I started keeping track of all my painful interaction with the code and visualizing it on a timeline like this.
The pain started [] when I ran into some unexpected behavior and ended [] when I had the problem resolved.
So that was 5 hours and 18 minutes of troubleshooting, I think everyone would agree that’s pretty painful.
The amount of pain was driven by two factors...
So If I wanted to know what was causing the pain I needed to understand the things that caused these 2 factors.
A lot of the problems had more to do with human factors than anything going on with the code.
Stale Memory mistakes, Ambiguous Clues.
But once I understood what was causing the pain, [read -- most of the problems were easy to avoid]
For example...
So If I wanted to know what was causing the pain I needed to understand the things that caused these 2 factors.
A lot of the problems had more to do with human factors than anything going on with the code.
Stale Memory mistakes, Ambiguous Clues.
But once I understood what was causing the pain, [read -- most of the problems were easy to avoid]
For example...
So If I wanted to know what was causing the pain I needed to understand the things that caused these 2 factors.
A lot of the problems had more to do with human factors than anything going on with the code.
Stale Memory mistakes, Ambiguous Clues.
But once I understood what was causing the pain, [read -- most of the problems were easy to avoid]
For example...
We’d always get a different set of bugs. What would you do?
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,
A typical improvement effort usually starts with brainstorming a list
[slow] We think about the things that bugged us recently, how we’re not following best practices, or the code that just makes us feel shameful.
[] -- Then all that goes into our technical debt backlog, and we chip away at improvements for months.
But just because a problem comes to mind, doesn’t mean it’s an important problem to solve
When we’re brainstorming, [] we can easily miss our biggest problems then [our improvements don’t make...].
[] Don’t do this.
Find out what the substitution thing is called.
Find out what the substitution thing is called.
When it comes to solving these really complex problems, our intuition is just wrong. It leads us astray.
The pain isn’t something inside the code, pain occurs during the process of interacting with the code.
I realized that pain occurs during the process of extending the software. In other words, the problem is here [] , not here. []
We’d been looking inside the code to find our problems, but that’s not where the problems were!
This process needed a name, so I called it Idea Flow. All that really matters is how it affects our experience and our ability to deliver.
Avoiding Pain
Software projects don’t run on a little island where we can make perfect decisions -- we have to operate within the context of a business system.
A lot of our problems have nothing to do with the code -- their the effects of the organizational structure and everyone operating within the bounds of their role.
We incentivize individuals with goals that undermine the system.
We put the whole business at risk because we don’t understand the consequences of our decisions.
When we fall into urgency mode, we start compromising safety for speed.
We make decisions that don’t seem like a big deal at the time, but they create a hazardous work environment.
Instead of taking a little more time to put our toys away, we end up falling down the stairs and in the hospital.
Troubleshooting Risk we’ve already talked about, it’s driven by the likelihood...
Learning Risk is driven by the likelihood...
Things like... lots of 3rd party libraries, complex frameworks, a really large code base, or high turn-over rate --
all these things can cause extra learning work.
Rework Risk is driven by the likelihood...
Things like... bad assumptions about the architecture or design or bad assumptions about customer requirements.
The longer we delay before making corrections, the greater the rework.
This is from a project about 10 months old where we actively focused on reducing troubleshooting time.
With our everyday problem-solving effort, we still spent about 10-20% of our time on friction.
This second example is from a rewrite effort about 18 months old, under a lot of pressure to hit feature parity.
[slow] About 40-60% of their time was spent troubleshooting problems
and nothing was being done about it.
This 3rd example is from a huge project about ~2.5 million lines of code where all the original developers have left.
On a typically task a developer would spend 90% of their time figuring out what to do, and 10% of their time changing the code.
Now I want to point out something. For all three projects, these tasks all took one to three days.
Generally speaking, as the problems build, we can still break down the work into bite-sized chunks.
but what we work on during that time dramatically changes.
[read] even when the problems are severe.
So if you thought about how much time you spend doing troubleshooting, learning and rework.
What percentage of time do you think it would be? Which do you think is the biggest?
What do you think the biggest causes are of troubleshooting time?
I know this takes a fair amount of work. I’m not going to lie. But we can
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,
If we don’t make time to deal with emerging risks and emerging risks, we will never get out of this cycle.
The pain isn’t something inside the code, pain occurs during the process of interacting with the code. The problems I focused on fundamentally changed.
If I write a little code then validate with unit tests as I go, even if I make a lot of mistakes, troubleshooting is usually fairly quick.
But If I’m in a hurry I might decide “eh, I’ll skip the tests” [] -- and then I spend all day troubleshooting mistakes.
Raise your hand if you’ve done this before?
When we’re under pressure to work as fast as we can, we make a lot more high-risk decisions.
If I make no mistakes [], skipping the unit tests does save time -- I can probably save a couple hours.
But if I make lots of mistakes [], I end up in troubleshooting hell and can easily lose the whole day.
Under constant urgency, we don’t stop and think and these high-risk decisions become a habit.
[slow] Our software is a reflection of our decision-making habits.
Our decisions are the cause of our pain.
When we get in the habit of making high-risk decisions, we get unmaintainable software as a result.
If we rewrite our software [], but don’t fundamentally change the way we make decisions, [] our problems just keep coming back.
[continue]
We’ve been blaming the code for causing the pain, but [read].
From the outside it looks like we’re trying to drive a car without a steering wheel.
We line up the car’s trajectory based on our ideals, then close our eyes and floor the gas pedal.
From the outside it looks like we’re trying to drive a car without a steering wheel.
We line up the car’s trajectory based on our ideals, then close our eyes and floor the gas pedal.
If I write a little code then validate with unit tests as I go, even if I make a lot of mistakes, troubleshooting is usually fairly quick.
But If I’m in a hurry I might decide “eh, I’ll skip the tests” [] -- and then I spend all day troubleshooting mistakes.
Raise your hand if you’ve done this before?
See the pain
But that wasn’t enough either… despite those things, he was still struggling.
But that wasn’t enough either… despite those things, he was still struggling.
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,
N
Rework Risk is driven by the likelihood...
Things like... bad assumptions about the architecture or design or bad assumptions about customer requirements.
The longer we delay before making corrections, the greater the rework.
Rework Risk is driven by the likelihood...
Things like... bad assumptions about the architecture or design or bad assumptions about customer requirements.
The longer we delay before making corrections, the greater the rework.
We optimize for execution time, even when the time spent on human cycles can completely dwarf the execution time. Why do you think that is?
We optimize for execution time, even when the time spent on human cycles can completely dwarf the execution time. Why do you think that is?
Used thinking checklists to codify a decision-making process… let me show you what I mean.
Focus on one decision principle until you have it down.
It’s not that best practices are bad, or wrong, they’re just backwards.
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,
If our feedback loop is broken, we don’t respond.
Human system design is a lot like software design. If you think the roles and responsibilities like objects, that interact with various parts of the environment, that collect data, perform tasks, and communicate. This is the kind of stuff you can see.
Draw a picture.
Human system design is a lot like software design. If you think the roles and responsibilities like objects, that interact with various parts of the environment, that collect data, perform tasks, and communicate. This is the kind of stuff you can see.
Draw a picture.
Risk management in software development is an extremely complex problem. Just like it takes time to figure out the right product to build, it takes time to figure out the right improvements. We have to gather requirements.
Too uncomfortable. Team doesn’t want to leave their comfort zone.
Feeling exposed by the data.
Feeling incompetent. (I don’t want to know)
Shared focus of the team.
Too much urgency to deliver features.
So... [read]
We’re actually in a state of Organizational Deadlock -- Everyone is waiting for a resource they don’t have.
[] -- [read]
Here’s how you get time to work on improvements.
First, make the decision to lead.
Step 1. [read] Leadership is not a title bestowed upon on you, it’s a choice to take responsibility. Nike’s got some good advice -- Just do it.
[read]
[read]
[read]
Next, you’ll need to make the case to management that change [read]
The key to success is focusing on the risks not estimating how much longer things will take. If it’s just more work, it sounds like we can throw more money at it, but working harder won’t solve the problem -- we have to work smarter.
We make decisions that save a few hours that lead to side effects that cost several hours. When we try to go faster, we do things that increase the likelihood of mistakes and the cost to recover when things go wrong. We’ve been in this pattern for the last 2 years, and now we’re here.
Share your Idea Flow Maps. There’s nothing like showing powerpoint slides with lots of red on them that gets managers to move.
To do the work, we had to setup data in the database, setup a new reporting template, then run the entire system at once to test the reports. When there’s a bug, it’s really hard to tell where the problem is and takes countless hours to track down the bugs.
This is a graph showing how often our development environment has been broken over the last month. Red dots are completely down. Blue dots are some features not working.
Whenever the environment is broken, it doesn't just impact one person. It usually impacts the entire team. The red dots are the times the environment has been completely down.
In the one I circled. []
So If I wanted to know what was causing the pain I needed to understand the things that caused these 2 factors.
A lot of the problems had more to do with human factors than anything going on with the code.
Stale Memory mistakes, Ambiguous Clues.
But once I understood what was causing the pain, [read -- most of the problems were easy to avoid]
For example...
I know we have a big deadline coming up, and we've been hurrying to get everything done, but in trying to go faster, we've dramatically increased risk.
Now, it's so expensive when things go wrong that trying to go faster is actually slowing us down.
If we rush to get the features completed, we're likely to arrive at the finish line with a lot of things broken.
On the other hand, if we focus on reducing the risk, we’ll end up in much better shape.
We’ve been collecting lots of data and have identified our biggest problems.
I think we can dramatically reduce risk with some focused effort.
I'd like to propose a 3-month trial [] with one person working full-time on these problems. The team will make decisions on the improvement work, and I’ll share our progress and lessons learned with you each month.
I know we can do this, but I need your help. Will you help me make this happen?
Since best practices are solution-focused, we’re always start with the hammer and looking for the nail.
Test automation is our favorite hammer.
Instead we need to be characterizing all the different types of nails,