Saturday, 23 April 2005

Managing the Pipeline

“I used to have everyone I needed in my business unit,” she told me. “But then we reorganized in the interest of efficiency, and all of my software people were moved into a central group. Before the reorganization, I had a great software architect who helped design all of our new products. Today we’re supposed to figure out exactly what we want before we get a project approved. Then the IT people assigned to my projects don’t know anything about my business, so the team spins its wheels for a long time before it gets traction. Efficiency? What a joke.”

A couple years earlier, this rapidly growing young company (call it XYZ) saw its revenue flattening and had regulators questioning its processes. In an attempt to impose discipline and cut costs, XYZ centralized software development. Executives felt that this would give them more visibility into the development portfolio and ensure that a standard process was used throughout the organization. They also hoped to even out the software development workload, increase utilization of resources, and eventually be able to outsource a good deal of development.

The problem was, XYZ’s products were largely based on software. When the software developers left the business units, the company's track record for fielding successful new products plummeted. Time-to-market stretched out for even the most important initiatives, and market share dropped. What went wrong?

People or Resources?
Company XYZ did very well when new products were designed and implemented inside of a business unit. When a key part of the product development team, those developing the software, were removed from the business unit, many important features of product development were lost. For example, funding was no longer incremental, based on stages and gates. Complete funding for product development had to be justified and allocated in order to get software development resources assigned. In turn, a complete specification and estimate of spending was required before work could begin. Software architects were no longer involved at the fuzzy front end of new product development, and the discovery loops that used to be part of product development were no longer acceptable.

Even as feedback loops were removed from product development, they became more important than ever, because the software development people on the product team were usually not familiar with the business, and worse, they were not even familiar with each other. In places where tacit knowledge and team cohesiveness are important, treating people as interchangeable resources just doesn’t work.

Scheduling
Prior to the reorganization, when software people were embedded in divisions, some people were regularly assigned to several projects at a time, while others appeared to be less than fully utilized. The company decided that it would be more effective to assign individuals to only one project at a time, and if possible, the projects should be small. Unfortunately, this created a scheduling nightmare, so XYZ invested in a computerized resource scheduling system to help sort out the complex resource assignments.

Computerized scheduling systems have a well known problem: they do not accommodate variation. XYZ discovered that if projects didn’t end when they were scheduled to end, the system’s assumptions were invalid, so the system’s resource assignments were often out of touch with reality. The company tried to fix such problems by keeping some teams intact and by holding weekly management meetings to arbitrate the conflicts between the computer’s schedule and reality. In practice, the overhead of management intervention and idle workers waiting for teams to assemble outweighed any efficiencies the system generated.

Company XYZ also tried to reduce the variability of project completion by urging teams to make reliable estimates and rewarding project managers who delivered on schedule. Unfortunately, such attempts to reduce variability generally don’t work. The reason for this becomes clear with a quick look at the theory of variation.

Variation
W. Edwards Deming[1] first popularized the theory of variation, which is now a cornerstone of Six Sigma programs. Deming taught that there are two kinds of variation: common variation and special variation. Common variation is inherent in the system, and special variation is something that can be discovered and corrected. Common variation can be measured and control charts can be used to keep the system within the predicted tolerances. But it is not possible for even the most dedicated workers to reduce common variation; the only way to reduce common variation is to change the system. And here’s the important point: Deming felt that most variation, (95%+)[2] is common variation, especially in systems where people are involved.

The other kind of variation is special variation, which is variation that can be attributed to a cause. Once the cause is determined, action can be taken to remove it. But there is danger here: “tampering” is taking action to remove common variation based on the mistaken belief that it is special variation. Deming insisted that tampering creates more problems that it fixes.

In summary: The overwhelming majority of variation is inherent in a system, and trying to remove that variation without changing the system only makes things worse. We can assume that most of the variation in project completion dates is common variation, but since computerized scheduling systems are deterministic, they can’t really deal with any variation. The bottom line: a computerized scheduling system will almost never work at the level of detail that XYZ was trying to use it. Exhorting workers to estimate more carefully and project mangers to be more diligent in meeting deadlines is not going to remove variation from projects. We need to change the rules of the game.

We know that estimates for large systems and for distant timeframes have a wide margin of uncertainty, made wider if the development team is an unknown. We should stop trying to change that; it is inherent in the system. If we want reliable estimates, we need to reduce the size of the work package being estimated and limit the estimate to the near future. Furthermore, estimates will be more accurate if the team implementing the system already exists, is familiar with the domain and technology, makes its own estimates covering a short period of time, and updates these estimates based on feedback. The good news is, once such a team establishes a track record, its variability can be measured and predicted.

Utilization
Unfortunately, Company XYZ believed that efficiency would be improved by increasing resource utilization. Trying to maximize utilization can have serious unexpected side effects, not the least of which is decreased efficiency and reduced utilization. If this seems odd, think about how efficient our highways are during rush hour. Most systems behave like traffic systems; as utilization of resources passes a critical point, non-linear effects take over, and everything slows to a crawl. Even the most brilliant scheduling system cannot prevent delays if you insist on 100% utilization.

When a computer operations manager looks at the utilization history of her equipment, she would never say: “Look at that – we’re only using 80% of our server capacity and 85% of our SAN’s. Let’s use them more efficiently!” She knows that such high utilization is a warning that the systems are operating on the edge of their capacity, and even now response times are probably slowing down.

But when a development manager takes a look at the utilization history of his department, he will often say: “Look at that – we are only using 95% of our available hours. We have enough free time to add another project!” At the same time, he is probably asking himself, “I wonder why I’m getting all these complaints about our response time?” And all too often his solution is, “We’ll just have to set more aggressive deadlines.”

Response Time
Consider the release manager of a software product. Assume she has service level agreement which calls for critical defects to be found and patched in four hours, serious defects to be found and patched in 48 hours, and normal defects to be fixed in the next monthly release. You can be sure that her primary measurement is response time and she adjusts staffing until the service level is achieved. Because of this, there will always be people available to attack defects, and occasionally people may have a bit of spare time.

In one of my classes, two teams did value stream maps for almost the same problem – deliver on a feature request which would take about 2 hours of coding. One team documented an average response time of 9 hours to deployment, the other team documented an average response time of 32 days. In the first case, the policy was: “When a request is approved, there will always be someone available to work on it.” In the second case, the request got stuck twice in two-week-long queues waiting for resources. The interesting thing is that the first organization actually did more work with fewer people, because they did not have to manage queues, customer queries, change requests and the like. They were more efficient despite, or perhaps because of, a focus on response time rather than resource utilization.

The bottom line is that managing response time, or time-to-market, is more efficient and more profitable than managing utilization. You need some slack to keep development and innovation flowing. As any good operations manager already knows, when work flows rapidly and reliably through an organization, its efficiency and utilization will be higher than in a organization jammed up with too much work.

Rules of the Game
Queuing theory gives us six rules for reducing software development cycle time:
  1. Limit work to capacity
  2. Even out the arrival of work
  3. Minimize the number of Things-in-Process
  4. Minimize the size of the Things-in-Process
  5. Establish a regular cadence
  6. Use pull scheduling
1. Limit work to capacity
The biggest favor you can do for your organization is not to accept any more work than it can handle. Of course, to do this, you have to know the capacity of your organization. One way to estimate the capacity is to look at output. If you currently complete one large system a year, deliver about three services a quarter, and respond to about seven change requests per week, this is a rough approximation of your capacity, and a good limit on the amount of new work you should accept.

Next you might calculate how much work you have already accepted. In one of my classes, an executive did the math and discovered that he had seven years worth of work in a queue that was reviewed every week. He decided that he could toss out all but a few months of work; the rest would never get done, but it was consuming a lot of time.

2. Even out the arrival of work
At Company XYZ, one of the scheduling headaches was caused by a huge workload during the first six months of the year, and a relatively low demand for the second half of the year. At first this puzzled me, because the company’s business was not seasonal, so there seemed to be no reason for the uneven demand. I suspected that there was a sub-optimizing measurement somewhere that might be the cause.; When I asked if the annual budgeting cycle or executive performance measurement system might be driving uneven demand, my suspicions were confirmed. I recommended that the organization work to change the measurement system, rather than accommodate it.

3. Minimize the number of Things-in-Process
One of the basic laws of queuing theory is Little’s Law[3]:
According to Little’s Law there are two ways to improve response time: you can spend money to improve the Average Completion Rate, or you can apply intellectual fortitude to reduce the number of Things-in-Process. For example, assume you can respond to about six feature requests per month. If twelve requests are released for work, they will take an average of two months to complete. If, however, only three requests are released at a time, it will take an average of two weeks to respond to a feature request.

4. Minimize the size of the Things-in-Process
We’ve already noted the effect of high utilization on cycle time; we should also note that as batch size increases the effect is much more pronounced.[4] This is shown in the graph:



So if you want high utilization, you should develop in very small batches. For example, you will get much faster throughput and higher utilization if you develop ten services one at a time, rather than developing all ten at the same time.

5. Establish a regular cadence
In a lean factory, every process is runs at a regular cadence called ‘tact time.’ If you want to produce 80 cars in 8 hours, you produce 10 cars per hour, so one car rolls of the line every 6 minutes. In software development the recommended practice is to establish an iteration cadence of perhaps two weeks or a month, and deliver small batches of deployment-ready software every iteration.

A regular cadence, or ‘heartbeat,’ establishes the capability of a team to reliably deliver working software at a dependable velocity. An organization that delivers at a regular cadence has established its process capability and can easily measure its capacity.

A regular cadence also gives inter-dependent teams synchronization points that they can depend on. Synchronization points are good places to get customer feedback, they are useful for coordinating the work across multiple feature teams, and they can help decouple hardware development from software development.

6. Use pull scheduling
Once both batch and queue size have been reduced and a cadence has been established, pull scheduling is the best method to compensate for variation and limit work to capacity. At the beginning of an iteration, the team ‘pulls’ work from a prioritized queue. They pull only the amount of work that they have demonstrated they can complete in an iteration. When a team is first formed or the project is new, it may take a couple of iterations for the team to establish its ‘velocity’ (the amount of work it can complete in an iteration). But once the team hits its stride, it can reliably estimate how much work can be done in an iteration and that is the amount of work it pulls from the queue.

There are other points where queues might be established: there could be a queue of proposed work that needs a ROM (Rough Order of Magnitude estimate). There may be a queue of work for a preliminary architecture assessment. (See figure below.) Note that these queues should be short, and a team should not pull work from a queue until it has available time to do the work.
A pull system assures that everyone always has something to do (unless a queue is empty), but no one is overloaded. The development process is managed by managing the queues. Management intervention is accomplished by changing the priority or contents of the queues. The cadence should be fast enough that changes can wait until the next iteration; in which case, changes are accommodated at the cadence of the process.

Conclusion
Development teams can do a lot to control their own destiny. They can make sure they have the right information, the necessary skills, and the appropriate processes to do a good job. But some things that impact the performance of the development team are outside of their control. Managing the pipeline is one of those things. If a development organization is swamped with work, no amount of good intentions or good process can overcome the laws of physics. If deterministic rules are applied to an inherently variable system, no amount of exhortation, reward, or punishment can make the system work. When the rules of the game have to change, the six rules for reducing cycle time are a good place to start.

References
[1]W. Edwards Deming – 1900-1993. Thought leader of the Quality Movement in Japan and the US.

[2] W. Edwards Deming, The New Economics, Second Edition, MIT Press, 2000, p 38.

[3] Usually the numerator is WIP (Work-in-Process). The term Things-in-Process comes from Michael L. George and Stephen A. Wilson, Conquering Complexity in Your Business, McGraw-Hill, 2004, p.37.

[4] See Factory Physics by Wallace Hopp and Mark Spearman, McGraw-Hill, 2000

Disclaimer: XYZ Company is not a real company, it is an amalgamation of companies I have worked with.

Thursday, 21 April 2005

Breaking The Quality–Speed Compromise

In the hotly contested commodity business of assembling computers, Dell enjoys a 50% cost advantage over its competitors.[1] This commanding advantage comes from Dell’s exceptional responsiveness to customers, flawless operations, and remarkable speed of execution. Conventional wisdom once held that the low cost producer could not provide customized high quality products. But Dell decided that its customers could have it all – low cost computers with the latest technology, built to order and delivered in a week.

In the book Hardball, George Stalk notes that when an industry imposes a compromise on its customers, the company that breaks the compromise stands to gain a significant competitive advantage.[2] For example, the airline industry imposes a big compromise on travelers: if you want low cost tickets, you have to make your plans early and pay a stiff penalty to change them. Southwest Airlines breaks this compromise: its customers can apply the cost of unused tickets to a future flight without a change fee.

In the software development industry, we impose many compromises on our customers. We tell them that high quality software takes a lot of time; we ask them to decide exactly what they want when they don’t really know; we make it clear that changes late in the development process will be very expensive. There’s a significant competitive advantage waiting for companies that can break these compromises. In particular, I’d like to focus on breaking the compromise between quality and speed, because many companies have achieved great leverage by competing on the basis of time.

When I teach classes on Lean Software Development, the first thing we do is draw value stream maps of existing software development processes. Starting with a customer request, the class draws each step that the request goes through as it is turned into deployed software which solves the customer’s problem. The average time for each step is noted, as well as the time between steps, giving a picture of the total time it takes to respond to a customer.

Next the class determines how much of the time between request and deployment is spent actually working on the problem. Typically, less than 20% of the total time is spent doing work on the request; for 80+% of the time the request is waiting in some queue. For starters, driving down this queue-time will let us deliver software much faster without compromising quality.

But reducing wait time is not the only opportunity for faster software development. Typically the value stream maps in my classes show a big delay just before deployment, at a step which is usually called ‘verification’. Now, I don’t have any problem with verification just before deployment, but when I ask, “Do you find any defects during verification?” the answer is always “Yes.” Therein lies the problem. When a computer hits the end of Dell’s assembly line, it is powered on and it is expected to work. The verification step is not the time to find defects; by the time software hits verification, it should work.

The way to get rid of the big delay at verification is to move testing closer to coding – much closer. In fact, testing should happen immediately upon coding; if possible the test should have been written before the code. New code should be integrated into the overall system several times a day, with a suite of automated unit tests run each time. Acceptance tests for a feature should pass as soon as the feature is complete, and regression testing should be run on the integrated code daily or perhaps weekly.

Of course, this testing regime is not feasible with manual testing, automated unit and acceptance tests are required. While this may have been impractical a few years ago, the tools exist today to make automated testing practical. Obviously not all tests can be automated and not all automated test suites are fast enough to run frequently. But there are many ways to make automated testing more effective; for example, each layer is usually tested separately – ie. the business rules are tested below the GUI with most database calls mocked out.

In most of the value stream maps I see in my classes, there is a huge opportunity to move tests far forward in the process and catch defects at their source. Many companies spend a great deal of time tracking, prioritizing, and fixing a long queue of defects. Far better to never let a defect into the queue in the first place.

There is another area of my classes’ value stream maps that raises a flag. Toward the beginning of the map there is usually a step called ‘requirements’ which often interacts with a queue of change requests. Dealing with change requests takes a lot of time and approved changes create significant churn. There has been a feeling that if only we could get the requirements right, this ‘change churn’ would go away. But I generally find that the real problem is that the requirements were specified too early, when it was not really clear what was needed. The way to reduce requirements churn is to delay the detailed clarification of requirements, moving this step much closer to coding. This greatly reduces the change request queue, because you don’t need to change a decision that has not yet been made!

Toward the end of my classes, we draw a future value stream map, and invariably the new value stream maps show a dramatically shortened cycle time, the result of eliminating wait time, moving tests forward, and delaying detailed specification of requirements. We usually end up with a process in which cross-functional teams produce small, integrated, tested, deployment-ready packages of software at a regular cadence.

This kind of software development process exposes another compromise: conventional wisdom says that changes late in the development cycle are costly. If we are developing small bits of code without full knowledge of everything that the system will require, then we are going to have to be able to add new features late in the development process at about the same cost as incorporating them earlier.

The cost of adding or changing features depends on three things: the size of the change, the number of dependencies in the code, and whether or not the change is structural. Since we just agreed to keep development chunks small, let’s also agree to keep changes small. Then let’s agree that we are going to get the structural stuff right – including proper layering, modularization that fits the domain, appropriate scalability, etc.

We are left to conclude that the cost of non-structural change depends on the complexity of the code. There are several measurements of complexity, including the number of repetitions (the target is zero), the use of patterns (which reduce complexity), and McCabe scores (the number of decisions in a module). It has been shown that code with low complexity scores has the fewest defects, so a good measure of complexity is the number of defects.

Which brings us back to our testing regime. The most important thing we can do to break the compromises we impose on customers is to move testing forward and put it in-line with (or prior to) coding. Build suites of automated unit and acceptance tests, integrate code frequently, run the tests as often as possible. In other words, find and fix the defects before they even count as defects.

Companies that respond to customers a lot faster than their industry average can expect to grow three times faster and enjoy twice the profits of their competitors.[3] So there is a lot of competitive advantage available for the software development organization that can break the speed–quality compromise, and compete on the basis of time.

References

[1] Michael L. George and Stephen A. Wilson, Conquering Complexity in Your Business, McGraw-Hill, 2004, p.48.

[2] George Stalk and Rob Lachenauer, Hardball: Are you Playing to Play or Playing to Win, Harvard Business School Press, 2004

[3] George Stalk, Competing against Time: How time-based Competition is Reshaping Global Markets, Free Press, 2003, originally published in 1990. p.4.

Screen Beans Art, © A Bit Better Corporation