A lot of people try to measure developer productivity. But is it actually possible? I have spent the past four years figuring out which metrics work and how they can be implemented in teams. In this talk, I’ll share lessons from my journey and a few metrics that can help your team today.
2. The Elusive Quest to Measure Developer Productivity
“The single most important task of a manager
is to elicit peak performance from [their team].”
– Andy Grove, High Output Management
3. The Elusive Quest to Measure Developer Productivity
“Abi, just give me some number”
4. The Elusive Quest to Measure Developer Productivity
“If you can’t measure it, you can’t improve it.”
– Peter Drucker
8. The Elusive Quest to Measure Developer Productivity
Commits
Small, frequent commits support greater transparency, collaboration, and
continuous delivery.
Use cases:
• Reward teams with high number/frequency of commits
• Improve teams with low number/frequency of commits
9. The Elusive Quest to Measure Developer Productivity
“Not everything that can be counted... counts.”
– Albert Einstein
16. The Elusive Quest to Measure Developer Productivity
Output cannot be measured
accurately.
17. The Elusive Quest to Measure Developer Productivity
2. Lines of code
Poor measure of output:
- Differences in languages and formatting
- More lines of code is… worse
Good for understanding:
- The size of a software system
- How your codebase is changing
19. The Elusive Quest to Measure Developer Productivity
3. Pull request count
Poor measure of output:
- Doesn’t factor size or effort required of work
- Encourages unnecessarily small, chunked PRs
Good for undertanding:
- Release cadence and continuous delivery
21. The Elusive Quest to Measure Developer Productivity
4. Velocity points
Poor measure of output:
- Sizing is done before work is completed, not after
- Undermines usefulness of estimation process
Good for understanding:
- Delivery forecasts based on past estimates
23. The Elusive Quest to Measure Developer Productivity
5. “Impact”
Poor measure of output:
- Same flaws as Lines of Code
- Too abstract to be actionable
24. The Elusive Quest to Measure Developer Productivity
The Flawed Five
1. Commits
2. Lines of code
3. Pull request count
4. Velocity points
5. “Impact”
26. The Elusive Quest to Measure Developer Productivity
“We are flying blind.”
27. The Elusive Quest to Measure Developer Productivity
Why We Measure
1. Prompt action
2. Goals and alignment
3. Advocation
4. Higher purpose
28. The Elusive Quest to Measure Developer Productivity
1. Prompt action
Why We Measure
29. The Elusive Quest to Measure Developer Productivity
2. Goals and alignment
Why We Measure
30. The Elusive Quest to Measure Developer Productivity
3. Advocation
Why We Measure
31. The Elusive Quest to Measure Developer Productivity
4. Higher purpose
Why We Measure
32. The Elusive Quest to Measure Developer Productivity
Making Metrics Work
33. The Elusive Quest to Measure Developer Productivity
Making metrics work…
1. Measure process – not output
2. Measure against targets
3. Avoid individual metrics
34. The Elusive Quest to Measure Developer Productivity
Measure process – not output
• Code review turnaround time
• Pull request size
• Work in progress
• Time to open
• …
Several years ago, I was the CTO of a small software company. Things were going well, customers were happy, and I felt like I was doing a good job.
We’d recruited and hired great people, and the team seemed happy. We’d set up good workflows and pipelines, and our process felt efficient. We had regular 1:1s and retros, so everyone had constant feedback on how they could improve.
So things were going well.
But after a while… something started bothering me. I started feeling like we were plateauing. I couldn’t tell if we were actually getting better. I started questioning whether we were even good. Whether I was doing my job well.
I wasn’t the only one interested in knowing this. My boss was too. He asked the entire leadership team of the company to start reporting out metrics at our monthly meetings. He specifically asked me to provide some indicator of our teams output and productivity… what we were getting done.
I protested. I told him that I had never seen productivity metrics work on an engineering team. And that the metrics that I knew of... simply didn’t work. He told me, “Abi, I don’t care what it is but just give me some number.”
So I thought about it. And the more I thought about it, the more the problem frustrated me. Why wasn’t there a metric I could use? How was it that I had no way of measuring how well my team was doing? Why hadn’t this problem been solved?
To try and find answers, I reached out to some mentors to ask for help. The first person I spoke to had been a CTO for almost 20 years and managed hundreds of engineers. When I asked him what metrics could be used to look at productivity, he told me it was impossible... he told me that we were flawed in even asking that question.
I reached out to another CTO… he told me he wanted to develop something like NBA plus minus for software teams... a way to measure the value contributed by eacb developer... but he hadn’t figured out how.
I spoke to several other CTOs, and they were all at a loss.
I couldn’t believe what I was hearing. These companies were spending tens of millions of dollars on engineers, yet no one had any way to track how well they were doing, let alone whether they were getting better or worse?
This seemed crazy to me. In almost every other profession I knew of, for example sports, marketing, or sales… there are established ways to use metrics to understand how well your team is doing. And these metrics are usually generated by software. But ironically, in software development, we haven’t figured this out. We don’t have ways of measuring how well we’re doing, or whether we’re getting better.
I thought to myself, there had to be a way. And I was determined to figure it out.
My name is Abi and I’ve spent the past several years working on figuring out what kinds of metrics can be used on software teams, and how to use metrics in ways that promote positive behaviors and changes on teams.
A couple years ago I started a company called Pull Panda, where we developed an analytics product called Pull Analytics. GitHub acquired Pull Panda earlier this year, and now I’m working with an unbelievable team at GitHub to make these kinds of capabilities available to every team.
Today, I’m going to tell you some stories from my journey, and share with you some of what I’ve learned along the way.
A few months after I spoke to my mentors, I left my job to work on this problem full time. I quickly discovered that metrics are a really hard problem. There are many things we can measure in software, but very few that we should.
It took me awhile to realize this. I met with lots of companies… read up on all the literature and products that I could find. And I also spoke a lot with my dad.
See my dad is a retired software developer. And if there’s one thing that developers and dads share in common, it’s that they have a lot of opinions… and like to voice them. I’ll tell you more about my dad in a minute.
One of the first metrics I looked at was commits. GitHub had a bunch of graphs showing commits, so it seemed like an easy thing to count… I knew it wasn’t going to be the perfect metric… but it seemed useful.
For starters, having no commits was definitely a red flag. If there are no commits… no code is getting shipped… and on a software team, that’s a problem right? See how useful this metric is already? It gives you an awesome way to see how much work is getting done!
And it gets even better! You see, teams that I spoke with agreed that making small, frequent commits leads to better designed code. And so if you increase your number of commits, that’ll in effect result in smaller, more frequent commits. This is awesome!
I thought I was onto something… so I showed it to some CTOs and they thought it sounded kind of interesting… but then I brought it up with my dad over a family dinner. He thought it was garbage! He said that no developer would ever want to be measured like this.
You see, my dad had been a developer for 30 years. And throughout his career, he’d seen many situations where managers would roll out terrible metrics and piss off everyone on the team. It was terrible!
He told me that tracking commits was a horrible idea because it said nothing about the actual value of the work delivered. And if someone wanted to they could easily game the metric by creating extra commits… which would be stupid.
So if you can’t tell already, my dad was super sensitive about metrics. So sensitive, that he would blanketly reject almost any metric idea I pitched him on. I’d ask him, “how about this metric?” ... or “how about this one?”, and he’d just shake his head and say that the problem I was working on was impossible. He made me feel like I was speaking of evil and blasphemous things.
So the tension between us was rising… and one afternoon, it blew up. I was reading the book Accelerate. It’s an awesome book which analyzes high performing engineering organizations, and suggests some metrics which can be used for benchmarking.
So I was reading the book... and came across something in it that I thought was a great idea. I was working from my parents house and my dad had come home from getting groceries… as he walked in the house, I told him “Hey dad... check this out…. ” and read a line out of the book to him. He just shook his head and dismissed it, so I countered back... I said “what are YOU talking about?”. Well we got into a little verbal boxing match, and at some point he said “Abi, you know you’re not the Einstein of engineering metrics”. Given that I had been dedicating myself to researching engineering metrics, I was pretty offended… I started yelling at him, and things escalated to the point where I had to leave the house.
Neither of us are proud of that moment, but the more I’ve worked in this space, the more I think back to that conflict as a reminder of just how emotional of a topic metrics are for both managers and developers.
My frustration as a manager… was so intense. But the fear and opposition from my father, who was a developer… was equally intense. And understandably so, right?
We all know that metrics can easily be misused… and that this can cause a lot of harm. Bad metrics can make developers miserable. They can undermine the good processes and culture we hope to achieve with metrics in the first place.
There are a lot of ways this happens. An obvious one would be if a manager were to reward or punish developers on their team based on a metric like number of commits. This is bad. You shouldn’t do this.
But there are less obvious ways in which metrics can effect your team. And in fact, even good metrics can be harmful if they’re presented in certain ways.
Let me go into an example…
This is one of the features we built in Pull Panda. It shows the average code review turnaround times of everyone on your team, from lowest to highest. Code review turnaround time is the time it takes for someone to respond to a review that’s been requested of them.
Code review turnaround is a great metric – I’ll talk more about it later – but the way it’s presented here is not.
Take a look at this page… and ask yourself, what kinds of negative behaviors do you think occurred when teams started viewing this?
…
So there were a few behaviors we observed:
Review quality slipped because reviewers rushed to complete reviews
Teams started not requesting reviews on Fridays so it wouldn’t affect their metrics
What’s fascinating here is that review turnaround time is actually a really good. But even a good metrics can hurt your team when presented in a certain ways.
So we’ve looked at a couple ways metrics that seem good can backfire. This is really important. To be successful with metrics, you not only have to choose the right ones, but you have to manifest them and use them in the right ways.
I’ve highlighted how commits are a problematic metric. But it’s not the only one that spells trouble.
There are four other metrics that I see a lot companies using as a way measure productivity. But these metrics don’t work. I call them the “Flawed Five”.
In a moment I’ll walk through these metrics… but first…
I’ve said the word “productivity” throughout this talk, but we haven’t defined what productivity means in the context of software engineering.
I’ve spent a lot of time pondering this question. A few months ago I was in a meeting and I asked everyone in the room what their definition of productivity was.
One person raised their hand and suggested that productivity should be like GDP… something like the dollars of revenue a company brings in per engineer.
I thought to myself “that’s interesting… but that sounds kind of agricultural… we’re not building farms here…”
After this meeting I kept thinking about this and later that evening I googled the dictionary definition of “productivity”.
And the definition that came up was “the state or quality of producing something, especially crops”!
I thought this was hilarious… and of course it left me even more confused about what productivity is or how it should be measured in software development.
I’ve asked many developers and leaders about how they define productivity. And I’ve gotten many different responses. But the most common definition I get is “how much are we getting done?”. In other words, output. And this makes sense… after all, we do produce things in software development. And it’s therefore logical that we should try to measure “how much”.
But the problem in software development is that we can’t. The process of producing software isn’t like a factory assembly line where you can count how many widgets are produced and how much each one cost.
Software is more like art… like creating a painting… putting more paint on a canvas isn’t better, and similarly, having more lines of code or more pull requests doesn’t mean you’re creating something better.
It might take an artist a day of deep thought to figure out the perfect brush stroke, and similarly, software is a creative process where a small amount of code can take an immense amount of time and effort to figure out.
So in software, we can’t really measure output.
But that doesn’t stop people from trying. And that’s what the Flawed Five metrics share in common. They are all measures of output that are used to represent productivity – and they don’t work.
Let’s go through them.
Number of lines of code is a metric that’s been around for a long time. And it’s a really bad measure of productivity.
For starters, there are different languages and formatting conventions that greatly vary in the number of lines of code they generate. So 3 lines of code in one programming language might be exactly the same thing as 9 lines in another.
On top of that, any good developer knows that they can code the same stuff with huge variations in lines of code… and that refactoring code, which is good, results in less code.
So not only is lines of code inaccurate, but it incentivizes programming approaches that are counter to what leads to good software.
Unfortunately, lines of code is still a really common metric used in our industry.
I come across companies that use lines of code as a way of evaluating developers’ contributions to their team, even stack ranking and terminating developers based on it. I think we’d all agree that this isn’t good practice, but its surprisingly common.
We need to move away from this.
Another metric that I see being used to measure productivity is pull request count. Counting pull requests seems to be a fairly recent trend. I was at a meet-up last year and a manager told me that that “pull request count is the new vanity metric”.
And I completely agree with him. Counting pull requests is a vanity metric. It’s not a good way of measuring how much work is getting done.
Tracking the number of pull requests created or merged doesn’t factor in the size, effort, or impact of that work. So it tells you almost nothing other than the number of pull requests created.
Like lines of code, this metric can encourage counterproductive behaviors. For example, this metric could encourage developers to unnecessarily split up their work into smaller pull requests, which would create more work and noise for the team.
I’ve seen this metric spreading like wildfire across our industry. A recent example I came across is GitLab’s engineering OKRs. These are published on their website.
In this OKR, their objective is to improve productivity by 60%. And they intend to achieve and measure this by increasing the number of merge requests created per engineer by 20%.
I don’t think this is a good practice. Counting pull request might seem less offensive than counting lines of code, but both metrics suffer from similar flaws.
Ok, let’s move on
Velocity points can be an unpleasant subject. I think a lot of developers see them as necessary evil.
I’m personally a big fan of velocity points and think that they be an outstanding way of sizing and estimating work.
You run into problems, though, when you try to turn velocity into a measurement of productivity.
When you reward people or teams based on the amount of points they complete, they are incentivized to inflate their estimates in order to increase their number…. When this happens, it makes the estimates and the number of points you are completing meaningless.
So as soon as you start using points to measure productivity, points become useless for their designed purpose.
Impact is a new, proprietary metric offered by several prominent vendors in the engineering analytics space.
“Impact” is an evolved version of “Lines of code” that factors in things like how many different files were changes and what amount of changes were new code vs changing existing code. All these factors are combined to calculate what is called an “Impact” score for each developer or team.
I’ve observed many companies that have tried this metric, and developers almost always hate it. Not only does this metric suffer from the same flaws as Lines of code, but it’s really difficult to understand because it’s calculated using a number of factors.
Then there’s the naming of it. Calling a metric “Impact” sends a strong signal about it how it should be used, particularly by managers. And this makes it very easy to misuse.
Do you remember the story about my dad? This is exactly the kind of stuff he was terrified about.
So to recap, these are the Flawed Five. These metrics attempt to measure output and tie it to productivity… and as I’ve outlined, these metrics don’t work.
But why is it that these five metrics are still so prevalent? Why is it that we keep using them despite their flaws?
If you recall the story I shared at the beginning of this talk… that frustration I had of not being able to measure how my team was doing… that was a strong emotion.
And as I’ve spoken to many people across the industry, I have come to see how widespread and powerful the desire to measure is.
I’ve met with leaders in charge of tens of thousands of engineers who tell me they are flying blind… the maintainers of some of the largest open source projects that have no way of seeing the health of their communities or impact of their work. And countless developers and managers who are looking for a way to measure their progress and improvement.
These are all scenarios where measurement would be incredibly useful. But there aren’t good solutions today.
And it’s this burning desire to measure… this desperation.. that can lead us into the trap of measuring the wrong things or using metrics in the wrong ways.
So to help prevent ourselves from falling into these traps, we need to better understand ourselves. We need to be aware of what we’re trying to measure, and why.
And the “why” is really important. Understanding the “why” allows us to recognize our biases, avoid the trap of misusing metrics, and move toward better solutions.
I think that in almost all cases, our desire to measure is driven by four fundamental reasons… four “whys”.
Let’s go through them.
The first and most obvious reason we want to measure is to inform our actions and decisions.
For example, I was talking to an engineering director a few weeks ago who wanted a way to more evenly distribute code reviews across their team.
He wanted to create a dashboard that their team could use to see.
All teams and organizations need ways of agreeing on what they’re trying to do,
<Story of director who wants to ask for more headcount, or $10mm investment in a new tool>
<We see this problem with GitHub itself>