On one hand, you have executives begging Engineering Managers, “Just give me some number. Any number.” On the other hand, you have Engineers with severe PTSD flashbacks anytime anyone says anything about metrics.
Measuring productivity of software development teams is easy. Unless you want to do it well. In that case, it’s really, really hard.
But, it’s not impossible. In this post, we’ll explain why companies measure Software Engineering productivity, why it’s difficult, and how to measure the productivity of Software Development teams.
Why companies measure Software Engineering productivity
According to author Steve McConnell, organizations might want to measure how productive their Software Engineers teams are for several reasons, including:
- To develop competitive analyses and benchmarks
- To track progress over time
- To reward high performers
- To determine resource allocation
- To identify and spread more productive development processes across the organization
In addition, organizations are looking for measures to help them identify and motivate the behaviors that correlate with revenue. And Engineering teams want measures that can help them justify their investments.
Examples of Engineer productivity metrics
Here are some metrics that many organizations use to measure software development productivity:
- Lines of code per staff per month
- Function points per staff per month
- Story points per staff per month
- 360-degree peer evaluations
- Engineering leader evaluations
- Task-completion predictability
- Test cases passed
- Defect counts
- Cycle times
Abi Noda, Senior Product Manager at GitHub, often sees companies relying the following metrics, which he calls the “flawed five:”
1. Commits
2. Lines of code
3. Pull requests
4. Velocity points
5. “Impact”
Why measuring Software Engineering productivity is so painful
Before you can start measuring productivity, you need to define it. “An awful lot of the issues that come up when measuring productivity actually come down to the real question of maybe not having a clear definition of what productivity is,” McConnell said. “Obviously it’s hard to measure something if we’re not sure what we’re measuring.”
Noda says when he asks people to define productivity, many of their answers come down to output. What are we creating? This definition leaves out the question of efficiency. If you commit lots and lots of resources, you’ll get more output, but at what cost?
If you want your business to succeed, a better definition of productivity is output/input. That is, divide your output by your input to get your ROI.
In this equation, your outputs should tie directly to revenue. But that’s easier said than done.
1. Many metrics correlate weakly with revenue
As Noda puts it, “Output cannot be measured accurately.” Any good measure of output should correspond strongly with revenue. But, many metrics for developer productivity correlate weakly with revenue. Take “lines of code.” Not only does it correspond weakly to revenue, it doesn’t correspond closely to product functionality.
According to McConnell, two Engineers can build the exact same feature with the same functionality and their code volume can vary by a factor of ten. Plus, the more code you have, the more expensive it is to keep it working.
Function points are also only loosely tied to revenue. You’ll often have a ton of function points that don’t positively impact revenue. Some projects never launch. Others launch and flop in the marketplace.
Velocity points are also poorly correlated with revenue.
2. Many metrics are hard to tie to individual contributors
Other measurable outputs are easier to tie to revenue but harder to attribute to a particular team and difficult to attribute to team members’ individual performance.
These include:
- Bug fixes
- Closed change requests
- Uptime/SLA
- Support for company strategy
Many inputs are also hard to measure at the individual level. These can include:
- Technical staff hours
- Staff hours defining requirements, etc.
- Hardware
- Technical debt
For these reasons and more, Noda recommends organizations measure team productivity but not individual software engineer productivity.
3. Many measures are weaker signals than Engineer variability
When it comes to determining how methods like pair programming impact productivity, it’s hard because McConnell points to research showing that many if not most impacts are swallowed by the difference in productivity between Engineers which is 20/1, on average. That is to say the highest-performing Engineer is on average 20x more productive than the lowest-performing Engineer.
4. Many metrics are easy to game
Some metrics end up creating perverse incentives. Let’s take average code review turnaround time as an example. It’s the time between when someone requests a review and the reviewer responds. Noda recommends this measure.
However, the downside is that measuring time to review code in isolation encourages reviewers to be less thorough to improve their stats. Which is what Noda found his reviewers doing, after implementing this measure. He also found staffers not requesting code reviews on Fridays because the software they used to measure didn’t take weekends into account.
To game pull request counts, you can make smaller changes that require less work.
To game velocity points, you can inflate the number of points a task will require to complete. This benefits you while making your estimates less accurate.
5. Some metrics penalize Engineers for their ambition
A metric that incentivizes Engineers to be less productive is obviously flawed. Using pull request counts effectively punishes Engineers for tackling large, hairy problems. As does using points per sprint.
Identifying the right Engineering productivity metrics
According to McConnell, a good measure of Engineering productivity has the following attributes:
- Truly reflects productivity as defined above/correlates closely with revenue
- Includes all work output
- Incorporates non-Engineering work (testers)
- Gaming-resistant
- Objective and independently verifiable
- Language-agnostic
- Can be compared across projects
- Gives the best people the hardest assignments
- Easy and cheap to measure
Noda recommends teams measure processes as opposed to outputs. He claims when you improve processes you increase productivity. Process metrics include:
- Code review turnaround time
- Pull request size
- Work-in-progress
- Time-to-open
Noda says developers care about these metrics. It seems likely that generally speaking processes that make developers happier will lead to productivity. However, you could easily imagine instances where that’s not the case. The obvious criticism here is that without measuring productivity itself, how do you know what counts as an improvement as opposed to just a change?
He also suggests that every team is unique. Each team should set their own targets for each of these process measures and then be evaluated based on their performance relative to their targets. One team might have a goal of 24 hours for code review turnaround time, while another might have 36 hours. This reduces the temptation to juke the stats beyond the goal. And it sets a more fair playing field between teams.
Every measure is imperfect. To get the most accurate picture, it’s best to use multiple metrics so you’re dealing with multiple pros and cons.
McConnell recommends scoring every metric you’re considering with 1-5 (with one being terrible and 5 being excellent) based on how well it meets the criteria you care about.
For example, you might give pull request counts a low score for correlation with revenue but a high score for helping teams understand release cadence and continuous delivery.
Manager evaluation might get a high score for cost-to-measure since you should be doing these anyway but a low score for measuring the work of non-programmers.
Once you’ve scored all your metrics, you can pick the highest-rated measures and create a scorecard. Multiple metrics are harder to game than any one single metric in isolation. If you over-optimize for one measure you’re likely to see decreases in other measures. These scorecards can be used across various projects and can be made public and reviewed. The downsides are that having multiple metrics to measure is more time-consuming than one or two metrics. Plus, the scores are harder to independently verify.
Going forward
Most organizations want to measure Engineering productivity because doing so effectively leads to more revenue. For this reason, organizations have developed many common metrics for measuring Software Engineering productivity. Unfortunately, most of these metrics are deeply flawed. Most metrics correlate only loosely with revenue. And the metrics that tie more closely to revenue are hard to attribute to specific individuals or teams. Plus, many metrics have unintended consequences.
Your best bet is to evaluate each potential metric based on its pros and cons. Then take your highest-rated metrics and create a scorecard out of those metrics that you can use to evaluate Engineering teams’ productivity.