Tuesday, 18 February 2014

The Scaling Dilemma

“One of the most scalable organizations in human history was the Roman army. Its defining unit: The squad – eight guys. The number of guys that could fit in a tent,” says Chris Fry, who knows a bit about scaling. He led software development at Salesforce.com during its years of hyper growth, and is now SVP of Engineering at Twitter. Fry found that the way to build a scalable organization is to focus on the basic building blocks – small, stable, multidisciplinary teams that are expected to independently tackle problems, make decisions, and get things done. Fry advises, “When it comes to building a deeply efficient engineering organization, there are several things you can do to move the needle:
  • Build strong teams first. Assign them problems later.
  • Keep teams together.
  • Go modular. Remove dependencies.
  • Establish a short, regular ship cycle.”[1]
Amazon.com works on the same principle. Its basic unit is the two-pizza team – a team small enough to be fed with two pizzas. When Amazon needed to scale, two-pizza teams were chartered to build Amazon Web Services, one service per team. A team includes everyone needed to design, deliver, and support the service – from specifications to operations. If a service is too large for a two-pizza team, Amazon prefers split the service into smaller pieces rather than to combine teams to deal with the larger service, because this preserves the dynamic interactions of small teams.

What Could Go Wrong?

If it works for Salesforce, Amazon and Twitter, surely it will work for you… So you form a lot of small, independent, multidisciplinary teams and you are careful to give them all clear goals. What could go wrong with that?

August 27, Scene 1

“Hi Owen, how’s it going?”

“Just great! We’re going to make our target. We got the last piece done overnight. We’re working on integration testing right now. We’ll have it ready to release tomorrow. Lucky for me, because this month I really need that bonus.”

August 27, Scene 2

“Hey George, I hear you had a shutdown last night.”

“Yeah, we got things up fast, but it still counts against our shutdown limit. It’s the last one we can afford this month, if we want our bonuses. So there won’t be any more.”

“Do you know what caused it?”

“The usual. A new release from development. A naive piece of code, the kind of thing you can’t test for. Anyway, it’s fixed.”

“And you’ve made sure they won’t make that mistake again?”

“Nah, they don’t want to listen to us. We’re just not going to put up any more releases until September. We’re not going to miss our target.”

“But I hear they have another release almost ready to go.”

“Over my dead body.”

The Dilemma

It is a beautiful thing when the building block squads of an organization gel into high performance teams and can be counted on to meet challenging goals. But tricky part is, once you create these strong independent teams, how do you get them to work together? How do teams maintain their autonomous character while working in concert with an increasingly large network of other teams? How do you make sure that each team has a clear goal, but none of the goals are in conflict?

In a lean environment, the leader’s role is to set up strong teams, to be sure, but it is also to devise a system – let’s call it a goal system – which assigns goals to teams. This is no easy task. Teams need clear, meaningful goals; they have to be the right goals; teams must have the capacity, capability and autonomy to achieve their goals; and most important, the goals of various teams cannot conflict with each other.

One thing we know for certain is that local goals create local optimization. So it’s clear that the start of a goal system is a system-level, unifying goal. Something that conveys the purpose of the work, the why. Some way to confirm that progress is being made – at the team level – toward achieving the overall purpose. Of course, this is a lot easier said than done.

The Goal System

Many companies use projects to set up a goal system. A project manager lets the team know what the project goal is and what everyone needs to do to reach the goal. If there are several competing projects, a Project Management Office (PMO) is added to manage the project portfolio and distribute corporate goals among projects. However, there are problems with the project approach. You don’t often see stable teams in a project company, because people are usually assigned at project start and reassigned at the end. Worse, people are often assigned to multiple projects with competing demands on team members’ time. Finally, since most projects are conceived of as a relatively large batch of work, project teams tend to be quite a bit larger than a squad of eight to fourteen people. So the basic building blocks of scale – small, intact, multidisciplinary teams – are rarely found in a project environment.

One of the things Scrum has contributed to the practice of software development is the idea that small autonomous teams perform much better than large project teams or single-discipline teams that work in sequence. So Scrum provides the building blocks of scale, but unfortunately, it does not contain a scalable system for choosing team goals, making sure they contribute to organizational goals and are in sync with the goals of other teams. So we need to look elsewhere for ways to set up a goal system.

The Theory of Constraints

People with a lean mindset might look to the Theory of Constraints (TOC) for guidance on choosing and communicating team goals because it has a good track record for directing the efforts of multiple teams toward a single goal, at least in manufacturing.[2] TOC starts with the assumption that in any system there will always be a constraint that gets in the way of achieving the system goal, and the way to keep teams working toward the overall system goal is to be sure that everyone is focused on getting more work through the system constraint. Let’s see how TOC might be applied to developing a software system.

When the Constraint is Technical

The first step – after clarifying the overall system purpose and goal – is to find the biggest constraint to achieving that goal. For purpose of discussion, let’s choose one of the most typical technical constraints encountered in delivering a software system: the integration of various components of the system without the introduction of defects or unintended consequences. In fact, project organizations typically allocate a third or more of the project time to release overhead – including integration, testing, fixing defects, and deployment – with the largest portion going to finding and fixing problems discovered during integration. When integration is the system constraint, TOC tells us that the most important focus for development teams should be removing this constraint.

Agile approaches to software development recommend the frequent delivery of working software to customers. When this recommendation is followed literally (software is released to end customers frequently), the integration constraint is regularly exposed and has to be confronted. One of the earliest agile approaches, Extreme Programming (XP), includes technical practices such as Test Driven Development and Continuous Integration that help make frequent releases practical. Continuous Delivery[3], which expands on these practices, has gained widespread favor as the agile approach which explicitly focuses on the integration constraint. In Continuous Delivery we find actionable advice on how to tackle the integration problem with techniques that scale across large networks of teams.

The objective of Continuous Delivery is to dramatically increase the number of times integration occurs while decreasing the amount of time it takes to negligible levels. This has the same effect that just-in-time flow does in manufacturing: the impact of defects is reduced to near zero because they are discovered immediately before they can propagate or hide. The problem is, a much wider swath of an organization needs to get involved in Continuous Delivery than is typically found on a development team. The system architecture has to be devisable, the marketing department has to figure out how to deal with frequent deliveries, the development and operations departments have to work closely together.

The Theory of Constraints can help here. If the constraint is integration, TOC recommends that we measure the rate at which work moves through the constraint – in this case the rate at which completely integrated and tested software is released to production – and make improvement of this rate the goal of every team involved in the system. It turns out that a throughput measurement on the system constraint is a great metric for team goals because it is easy to measure, provides immediate feedback, and is structured to result in improved system-wide performance. If everyone working on a system is trying to stabilize and improve the rate at which tested, integrated software is successfully released to end customers, then teams across the system will naturally have to work together and will find that their goals are compatible.

When the Constraint is Knowledge

More often than not, however, technical issues are not the biggest constraint in system development and the fundamental problem is not an integration problem, it is a design problem or a fitness-for-use problem. Far too often we end up with a system that doesn’t work well or is difficult to use. In this case, the biggest constraint in developing software systems is the way in which we decide what to build and parse that decision amongst the teams doing the work.

Project organizations spend a good deal of time deciding what to do and turning these decisions into goals or requirements. However, this activity is front-loaded into the beginning of the project, while verification that the project requirements are correct waits until the project is complete – too late to make changes. Most project organizations consider deciding what to build to be an execution problem rather than a constraint. They would say: It’s unfortunate that project goals and requirements are sometimes wrong, but the way to fix this is to work harder on getting them right before the next project.

Organizations with a lean mindset would frame the problem differently. They are likely to identify the biggest barrier to making good decisions as incomplete knowledge, and thus the biggest constraint of the system would be the rate at which knowledge is generated. So a lean organization would focus on the feedback loop between customers and development teams. They would decrease the length of the feedback loop, increase the speed of the feedback loop, and remove barriers to the free flow of information inside the feedback loop. These organizations would say: If we can test our assumptions and designs more quickly we will learn faster and make better decisions.[4]

No More Politics

One of the signs that teams have conflicting goals is negative politics. A good way to eliminate political wrangling is to get teams to work together toward a unifying goal and to show team members the impact of their actions on that goal – in real time.

August 27, Alternate Scene

“Hi George, I noticed a little glitch in the last night’s graph so I thought I’d check to see if anything went wrong.”

“Oh hi, Owen. Glad you stopped by. As a matter of fact, we did have a shutdown last night, but we were lucky and caught it right away and got things back up fast enough we didn’t lose any customers.”

“So what caused it, do you think?”

“It was a database lockout. It seems to happen a lot about six to eight hours after a new release. When we find it we do a workaround that lasts until the next release.”

“Whoa! That means it’s something in the code that isn’t getting caught in testing.”

“Well, yeah! There’re a lot of problems in code that only come out during production.”

“Um, well, we worked late last night to get that feature ready that Chris asked for. Do you think we can do another release tomorrow?”

“Well Owen, let’s take a look at the graph. You see last night there was a downward spike here, where we closed down the site and no one could use the app. But we got it up so fast that almost everyone was still around and we could just reconnect them. It was the middle of the night here, so we mostly had browsers in Europe and Asia. We don’t have many customers there – yet. But what if we hadn’t caught it so fast? In five minutes we would have lost maybe half of the people using the app. What if it had happened during the daytime here? That spike would have been an order of magnitude bigger, and we would have lost a lot more people. It’s one of our busiest seasons, right before school starts. Do you really think we should take such a risk?”

“But George, we have to do it sometime. Sooner is better than later.”

“Not the way I see it, Owen. The fewer releases we put out, the fewer customers we’re likely to disrupt.”

“Okay, I see your point, but we’ve got to fix that problem so we can release frequently, because new features are what drives that graph up higher.”

“Not if the system crashes, they aren’t.”

“Whatever. We still have to fix the problem.”

“Yeah, so how do you propose we do that?”

“How about a side-by-side release with a trigger that knocks out the new system if it gets flaky?”

“Easy for you to say, Owen. You don’t have to make it happen.”

“But if I could make it happen, would you let me?”

“Sure, why not? But it won’t be easy.”

“How about I get with the team and tell them that in order to have a release, we have to write some failure detection and recovery code, and babysit the next release around the clock to be sure it works.”

“Can’t hurt to give it a try.”

The Unifying Goal

A network of strong teams is the first step to scale. The second step is to set up a system that distributes goals to teams in a way that avoids goal conflict. One good way to do this is to find a goal that is the final arbitrator of ‘good,’ make it visible to all teams in real time, and hold teams responsible for it.

The purpose of throughput accounting in the Theory of Constraints is to create just such a unifying goal. Simply stated, throughput accounting provides a measurement of the rate at which an organization achieves its purpose. This rate is effectively the same as the rate of throughput at the system constraint, so teams working to improve throughput at the constraint or to improve overall system impact are working toward the same goal. Either way, a single unifying goal allows individual teams to act autonomously, confident that they are not working at cross purposes with other teams.

What if you cannot find a unifying goal that represents the system constraint, or if a team’s work has no apparent impact on that goal? Over time it would be better to move to a decoupled architecture so that individual teams can have an impact on the overall system goal. But in the meantime, each team should monitor the impact of its work on its immediate customers. The question is not: Did the team complete its work? The proper question is: Did the team give its immediate customers what was needed, when it was needed, in a state that allowed the downstream teams to perform their work well?

Finally, remember that scaling is a two way street. If you think that scaling gives you problems as a leader, imagine the problems it brings to other people in the organization. It’s fun to work at a small company where everyone knows what’s important and works together to make the company successful. But as the company grows, people at all levels are in danger of losing their line of sight from what they are doing to the company’s success. This limits their ability to act with initiative to bring about that success, and thus undermines their engagement. You can’t scale unless you keep all of the bright minds in the company engaged and working together to help the company grow.
______________________________

Footnotes

[1] From First Round Review: Unlocking the Power of Stable Teams with Twitter's SVP of Engineering.

[2] I was reminded of this as I read Tame the Flow by Steve Tendon and Wolfram Müller, which provides a nice summary of how the Theory of Constraints in general, and throughput accounting in particular, can generate much better team goals than either the work-bin WIP limits of Kanban or the cost/profit focus of traditional cost accounting.

[3] See Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by Jez Humble and Dave Farley (2010), and Lean Enterprise: Adopting Continuous Delivery, DevOps, and Lean Startup at Scale by Jez Humble , Barry O’Reilly, and Joanne Molesky (2014).

[4] It’s interesting to note that Continuous Delivery is an excellent way to address the knowledge constraint as well as the technical constraint of software development.