October 5, 2015
How to gain 98% Improvement in capacity without hiring a single person
Most organisations I work in, think that capacity is a function of how many people you have multiplied by the average utilisation of staff. In software development this is horribly wrong and leads to large bloated IT departments that produce very little value.
This incorrect belief leads ‘resource’ managers* to focus on increasing utilisation and hiring more staff which is the exact opposite of what is really needed to increase the amount of value needed. Following this path makes organisations weaker, less effective, more costly, and less competitive.
The reason managers think this to be true, is an incorrect belief that the problems in software development are merely complicated and not complex. The different is explained in my post on Cynefin. This leads to the incorrect belief that the relationship between capacity and number of staff multiplied by utilisation is linear. It is not. And what’s worse, it is an inverse relationship.
To understand the real relationship between capacity, utilisation and the value created, is absolutely key in understanding how to scale software development and really increase the amount of value produced. (Which is what we are trying to do with large Agile adoptions).
The hidden factor not being considered is the queue time between people and teams and the co-ordination and alignment across teams. Queues are all the places where unfinished, (and by Agile definition; un-valuable work), hides in your organisation. In this article, we are much more interested in the queue time between teams and chaos that this causes.
Queues between teams are made visible when every team has its own backlog. Whilst this a scaling Agile anti-pattern, it will serve as a way to highlight why queues are so important in overall capacity capability of the organisation.
Let’s look at a typically sequential product development process.
Source: 2 day scaling agile and organisational design course, based on Craig Larman’s talk at AWA
This is clearly not an Agile process, but it shows how the problems are between the teams and not in the teams. At first glance, when you imagine a single piece of work traveling through the system nothing seems wrong. The problem starts when multiple things go through the system at once.
All IT work is non-homogenous, that is to say, there is a high variability in complexity, technical risk and the time taken to complete items. Different teams take different times to complete different parts of any one feature travelling through the system.
Most organisations optimise systems around maximum utilisation of staff. Teams are forced to work (by the existence of such a set-up) on the highest priority item in their list which from a system perspective may be a low priority. When a higher priority item for a team is dependent on work from another team, it may not get done, because that team has no visibility of the overall system priority of work and so the high priority work is delayed, waiting for the low priority work to be complete in the other team.
First priority item that component team 7 can work on, is number 10 in the list.
Source: 2 day scaling agile and organisational design course, based on Craig Larman’s talk at AWA
This has two effects, one, that teams are working on low priority items, and two, that high priority items are delayed waiting for the low priority items.
In some cases, the splitting of work and re-assembling again due to wait times between teams can be as long as 18 months or at least 98% of the time spent working on the items.
The more teams you have, the more this problem comes into effect.
The entire system is de-optimised with huge wait times and additional people doing nothing more than splitting work and others sticking it back together again, with others trying to co-ordinate and manage dependencies. As you add more people and teams to this web of interactions, the queues multiple exponentially and the complexity and wait times increase geometrically.
As a simple exercise, if you find yourself presiding over a system similar to this, record the time at which items where first created at the very start of the process (say in the business analysis team), then for each item, record the time when they were finished as working software being used by customers, ideally with feedback attached. This is your cycle time per item.
What is interesting in this experiment is partially that the length of time is so large, but more interesting is the variability of cycle times. The larger the system and batch size of work, the more variable the cycle times. This directly maps to unpredictability in your ability to predict and deliver software on time.
In Agile prioritisation terms, we are saying you have a totally variable (partly random) cost of delay for features within your system. That means you have no way of prioritising the cost or value of any item in this type of system with any accuracy.
If you are using COD as a prioritisation technique, using a process like this, makes a total mockery of it, because the order that things come out of the system is totally random compared to what went in.
This variability is inherent in software creation. Any prioritisation at a system level is overridden by local prioritisation of work in each teams backlog. It becomes impossible to co-ordinate priorities between teams, without bartering, and loudest person first types of interactions with POs from one team trying desperately to convince POs from another team to prioritise some dependent item in their backlog so work can continue in some linear fashion. The number of dependencies and the complex nature of them is made much worse by the fact that every piece of work going through the system is of different duration and technical difficulty, resulting in even more unpredictability and an impossible co-ordination task.
If you still have PMs (which are needed in this type of sequential process), feel very sorry for them, because their job at keeping things on track is actually impossible.
The length of the queues (backlogs) for each team is key here, because the longer it takes for items to get pushed through the queue, the longer dependent teams have to wait for their critical item to be ready. This increases the variability of cycle time (bad thing).
The queue length and ability to expedite items through the system is inversely geometrically related to the utilisation of the people in the team.
Source: Copyright: Don Reinertsen – 2 day course and evening event at AWA on 2nd Generation Product delivery.
The formula above is Queue Length L = utilisation squared divided (1 minus utilisation).
We can see that the higher the utilisation, the exponentially higher the queue size. The higher the queue size, the more WIP, delay, and usually task switching. This makes co-ordination of dependencies much harder and therefore a lot of work is sitting unfinished waiting for other teams to get to the dependent piece. All that value is sitting there, hidden in code repositories, rotting away, waiting for other teams to finish their bit which is waiting in a large queue.
Trying to increase utilisation of staff within a team to get through the work quicker is exactly the opposite of what is needed. This only increases queue length and dependency problems. Looking at the graph, we can see that utilisation is optimal around 75-80% which is consistent with Tom DeMarco’s findings in his book Slack.
We can see from the equation that the average arrival and leaving times of items through the team is linear to the queue length, but the utilisation is exponential. Focusing on efficiency (utilisation) within the team hugely decreases overall flow and hence reduces capacity of the system as a whole.
There are two solutions to this problem to unwind the cycle of more people and higher utilisation resulting in less value and higher wait times.
The first is largely unpopular, but is the best solution. The answer is to provide a single backlog for the teams, this is a product backlog. Each team is a generalising specialist team and can therefore work on any item in that backlog. Each item is a slice of value through the system. Teams are largely independent of each other, being able to complete the entre range of skills needed to produce value. Each team has their own view into the backlog, and so doesn’t have to view the entre thing, but in effect, they are working from the one single backlog. In this solution, we are removing the queues altogether. This is provided with the framework Large Scale Scrum.
Queues are removed because each team is not dependent (or in practice, hardly dependent) on any other teams for the entire feature.
The second solution is even less popular. We wold need to actively manage utilisation. We could only schedule work from the backlog to say 50% of available time for all staff. Lowering the utilisation would allow bottlenecks to be removed as teams could ask others to finish their critical items and they would have the space to do it. This would reduce the overall work in the system but allow valuable items to get through the system much faster. It would reduce blocked dependencies significantly. This solution is rarely used because it challenges underlying beliefs too much and is non-intuitive to those who don’t have a grasp of systems level thinking.
In conclusion, to increase capacity, you must create a single product backlog and restructure teams to remove queues. Each team must be able to deliver full features end to end. To do this, you need generalising specialists within the team, and the ability to work on code in a shared ownership model and full continuous integration testing and deployment. You must get rid of the design function in separate architecture and BA teams (get rid of the whole concept of silo teams), all those people now sit within the delivery teams, and everyone works on items from the backlog.
Cost of delay calculations now work, because items are being delivered in the order at which they are added to the backlog. Everyone is working on the top priority work.
You don’t have to hire anyone new or try and increase utilisation in teams of individuals to gain 98% throughput. Instead reduce utilisation! You might even find you have too many people…
* Resource Management / Human Resources – I am a big fan of not calling people resources. I prefer people operations to human resources. “I am not a number, I am a free man!”.