Posted on 3 Comments

Measuring continuous improvement

Let’s walk through a couple of process improvement scenarios to explore the differences between measuring activities and measuring outcomes. The starting point is a software development team that has decided to improve software quality.

In the first scenario, the team chooses a solution before doing any analysis. Perhaps one or more team members had used the technique on previous jobs, and found it helped them. Perhaps the team has read or heard that the technique is a good practice. In any case, the choice is not based on an analysis of the current situation. They measure progress by tracking the team’s adoption of the selected technique. This is an example of measuring activities.

In the second scenario, the team has the same goal — to improve software quality — but they go about it in a very different way. Instead of measuring activities, they measure software quality as directly as they can. Then they explore different approaches to improvement and use metrics to assess the effects of each. This is an example of measuring outcomes.

I suggest that when we measure activities, we risk limiting our thinking in ways that can be detrimental. When we measure outcomes, we keep ourselves open to different approaches and different solutions. There is a pearl of wisdom floating around that says whatever we don’t measure, we can’t see. If we can’t see a problem, we can’t solve it.

When we are focusing on an activity for its own sake, we don’t see the outcome of that activity, and we don’t see alternatives to the activity that might yield better results — or if not better, then at least feasible in light of present circumstances. As long as we’re carrying out the activity “by the book,” we think everything is fine. When we eventually discover that everything isn’t fine after all, we are surprised…and it isn’t a happy surprise like a birthday present.

Note: Although the scenarios are written in the first person, they are not autobiographical; they are allegorical. Also, this is not a tale that relates events literally. It is a composite of things that happened on different occasions.

Scenario 1: Track the team’s adoption of a pre-selected solution

Our team decided to try and improve the quality of the software we produced. We talked about various ways to accomplish the goal. Quite a few options were available. We could improve our testing practices. We could pull testing activities forward in the process. We could conduct peer code reviews. We could look for ways to make our requirements definitions a little more concrete. We could start to use a published approach for delivering high-quality software, such as the Cleanroom method, Evolutionary Project Management, or Extreme Programming.

But those things sounded like a lot of work. What we’d really like to do was to find a way to get better results without any impact to our tactical delivery rate. One of our colleagues told us she worked at a company where they used test-driven development (TDD) to write their code. She said it worked for them, and her description of it sounded pretty simple. Actually, it sounded so simple that we could figure it out on our own, using web-based tutorials and common sense. No need for training courses or coaches or anything like that.

Okay, so how would we measure progress? Well, since we’d decided our approach would be to adopt the practice known as TDD, it seemed logical to measure progress by tracking our adoption of TDD. TDD consists of a three-step cycle, red-green-refactor, so all we had to do was look at what the programmers were doing to see if they were following that process. Red means write a failing test; that much sounds easy enough! Then green means make the test pass. Okay. Finally, refactor just means clean up the code. We’re already highly accomplished programmers with many years of experience, so that doesn’t sound like anything more complicated than common sense. No problem, then. Let’s go!

Here’s what happened:

[pdf]

So, at the top of the diagram we have our goal: Improve software quality. To do that, we decided to adopt the practice, TDD. To measure our progress, we checked to see if the programmers were following the red-green-refactor cycle when they wrote code.

That looked like it ought to work, but something funny happened. As we went forward, we experienced more and more problems with code quality. Instead of improving quality, TDD seemed to be destroying quality. According to our metrics, everything was proceeding well. Every time we asked the programmers if they were following the red-green-refactor cycle, they said yes. Surely they weren’t lying!

Logically, therefore, the problem had to be that TDD just didn’t work. The colleague who had recommended TDD to us was at a loss to explain what was happening. As far as she could remember, they hadn’t had problems like this on her previous job.

Eventually we decided we’d better discontinue using TDD. We were better off with the problems we were experiencing before.

An important part of any improvement initiative is to take the lessons learned forward to support double-loop learning in the organization. The key lesson learned in our experience was that TDD just doesn’t work. From that point forward, no one in our company would waste any time with TDD again. It had been an expensive lesson, but well worth it in the long run. We probably saved our company millions!

Some weeks later, I met an old friend for lunch and told him our story. He had some experience with TDD, and he had a slightly different interpretation of our experiences and our diagram than we had.

He had only barely glanced at the diagram before he said, “You’re measuring activities instead of outcomes.” I asked him what he meant, and he explained, “TDD is not an outcome. It’s not a goal. It’s a means to an end. It’s a technique. An activity.”

“But it’s an activity that’s supposed to improve code quality,” I protested.

“It can help with code quality, among other things,” he said with a shrug, “but here you aren’t looking at the problem you’re trying to solve at all. You’re just looking at methods or techniques. You might succeed in using a method or technique and still miss the mark.”

“No, that can’t be right,” I said. “If TDD produces good code, then if we’re using TDD we have to produce good code. Therefore, if we use TDD and we get bad code, TDD doesn’t work.”

“It’s not quite that simple,” my friend said. “TDD is just a tool, like a hammer. You can use a hammer to build a beautiful set of kitchen cabinets, or you can use it to smash all the bones in your thumb. The hammer itself doesn’t know the difference. It’s up to you to learn how to use the hammer properly.”

“Okay,” I conceded reluctantly, “but according to our metrics, we were using TDD properly. It even says so on the diagram: Are they following the red-green-refactor cycle? See? And in spite of that, we still had problems.”

“Yes, I see,” he said. “Look, there are some preconditions that have to be in place before you can really get a lot of value out of TDD.”

“Preconditions?” I asked, puzzled. “What preconditions? The online tutorial we used didn’t say anything about preconditions.”

“Well, it’s sort of like learning to use the hammer so that you don’t smash your thumb,” he said, fishing a pen from his computer bag and turning the diagram around so he could draw on it. “The specific problems you show on this diagram suggest certain probable root causes,” he said professorially.

After mocking his professorial manner with a clever bit of pantomime, I watched as he started to add boxes in between Team tries to use TDD and each of the Un-Desirable Effect boxes. The diagram ended up looking like this:

[pdf]

“Let’s start with this one: New code is test-driven but existing code is not remediated.”

“Okay,” I said triumphantly, “that’s a great example of where TDD falls short. It’s literally impossible to test-drive code that already exists!”

“Strictly speaking, that’s right,” he admitted, “because you can’t start with red. But that’s just what I was getting at when I said we should measure outcomes and not activities. You’re so focused on the activity — on the mechanism of the TDD cycle — that you overlook the results you’re getting. To avoid this particular problem, the team has to understand how to deal with legacy code. They have to clean up the legacy code bit by bit as they go along, in order to enable TDD to work.”

“Easier said than done!”

“True, but if the work were easy they wouldn’t need us. They could just toss the requirements into a room full of monkeys.”

“Good point. Okay, I get that,” I said, “but what about this one? Poor test coverage and/or missing test cases. You’re showing five different root causes for that one problem! Can that be right?”

“Those are just five I could think of off the top of my head right now. There are probably more.”

“Okay, I guess,” I said, scanning the “preconditions” my friend had added to the diagram. “What about this one, then? Build times steadily increase. You’ve got four root causes for that one, but doesn’t it all boil down to an inherent limitation in the usefulness of TDD? The more tests we wrote, the longer the build took to run.”

“Did you look at the preconditions?”

“Yeah, but…we’re already advanced programmers, you know. This thing about programmers understanding software engineering principles…that’s kind of a given, isn’t it?”

My friend shrugged. “Is it? You tell me.” I noticed he was tapping his fingers on the Build times steadily increase and Technical debt steadily increases boxes that were fed by the box labeled Programmers must understand & apply rigorous software engineering principles on my diagram.

I began to suspect some of the more junior folks on our team might not be completely up to speed in that area. Not me, of course. I’m advanced. I’m not so sure about those other guys, though. “Well, maybe it isn’t a given. And what’s this? Isolate code under test? Isolate it from what? Earthquakes? Violent video games? West Nile Virus?”

He laughed. “Isolate it from other code.”

“What does that mean?”

He smiled and looked at me for a moment as if wondering whether I was joking. Then he said, “The idea is that any given test case should be able to fail for only one reason: That the code under test doesn’t behave as expected. When you have dependencies on other parts of the code base or on external resources, your test case doesn’t have that property. It can fail for reasons unrelated to the intent of the test.”

“But how can we write a test that will run at all without accessing the external resources that the code under test uses?”

“There are ways, but none that I can show you in the time remaining for lunch today. Suffice it to say that one likely cause of your build times increasing is that you’re adding tests that have external dependencies, so the tests are taking too long to run.”

“Well, I’m not really convinced. I guess I’d have to see that.”

“Fair enough.”

“And just look at all these other things you’ve added to the diagram! If I take this literally, then it means team members in every role have to have all kinds of new skills, and learn each others’ jobs, before we can even try TDD. Is that even possible?”

“Well, it isn’t as bad as all that,” he said. “You don’t need to fulfill every one of these preconditions just to get started with TDD. You’d want to, if you expected to get full value from it. But the point here is really something a bit different. Not TDD as such, or any other practice in particular.”

“What, then?”

“The point is that when you’re trying to measure progress in a process improvement initiative, you need to measure the factors you’re trying to improve, rather than measuring the adoption of some specific practice or technique. Let the metrics guide you to the right practices and techniques. Don’t leap to the conclusion that some particular pratice is going to solve all your problems magically.”

“Not even TDD?” I quipped. “I thought you were a proponent.”

“I’m a proponent of whatever works in context,” he said.

“Hmm. Well, you seem to have a lot to say about TDD, for a neutral party.”

“I’m not neutral on the question of smashing my thumb with a hammer,” he said. “When I use a development technique, I want to use it right.”

“That’s starting to sound like the tiresome old ‘you did it wrong‘ thing that evengelists always say. On the other hand, I can’t deny that we smashed our thumb pretty good.”

My friend shrugged again. He does that a lot. “So, how do you measure quality?” I asked. “It seems to be a slippery concept. Everyone has their own definition.”

“Maybe so, but you’re not everyone; you’re just one team in one company. In your context, quality means whatever it needs to mean to you and your stakeholders. So the problem of definitions is not as open-ended as you might assume it is. The range of useful definitions will be limited by the context.”

Scenario 2: Track the changes in the factor we want to improve

In a parallel universe, the same team was in the same situation and had the same goal of improving software quality. They approached the problem differently.

 . . . . . . . . . . . . .

Our team decided to try and improve the quality of the software we produced. We talked about various ways to accomplish the goal. Quite a few options were available. We could improve our testing practices. We could pull testing activities forward in the process. We could conduct peer code reviews. We could look for ways to make our requirements definitions a little more concrete. We could start to use a published approach for delivering high-quality software, such as the Cleanroom method, Evolutionary Project Management, or Extreme Programming.

But it wasn’t obvious to us which of these options, if any, would get us to the level of code quality we wanted to have. We decided to take an empirical approach. We would tweak our development process in a series of time-bound experiments and measure the effect on code quality, gradually homing in on practices that helped us achieve the level of quality we wanted.

Here’s what happened:

[pdf]

You can see our goal, Improve code quality, at the top of the diagram. We thought it would be sensible to define what “code quality” means. Otherwise, how could we possibly know what we were aiming for?

Definitions

“Quality” is a pretty generic term. It can imply many different things. We decided that we should narrow it down to quality-related factors that affected three key stakeholder groups: Our customers, our production support group, and ourselves. We figured that as we made progress we could expand the definition to cover issues of interest to additional stakeholder groups. We just didn’t want to bite off more than we could chew at the start. The result was a three-fold definition of “quality:”

  • High customer satisfaction
  • Low defect density
  • High maintainability

Okay, now we had working definitions of “quality.” The next step was to come up with ways to measure these three aspects of quality.

Metrics

An interesting thing about the quality factors we selected is that they have different characteristics that affected our ability to measure them quantitatively. Defect density is completely quantitative and easy to measure. Customer satisfaction is subjective and qualitative, so it was a challenge to come up with ways to track improvement in that area. Code maintainability seems pretty subjective at first glance, but as it turned out we were able to find a quantitative measure that gave us a pretty reliable indication of how maintainability was trending.

Defect density is a well-known industry measurement that is simply the number of defects per thousand lines of source code. Generally, a value of about 0.036 is considered typical for business applications written in languages like Java, C#, or C++.

Normally, I’m not a big fan of counting lines of code. Most metrics based on lines of code tend to encourage people, even if unintentionally, to write more code rather than less. But defect density tends to have the opposite motivational side-effect. The simplest way to avoid defects is to reduce the total quantity of source code. When defect density is measured, people tend to do just that.

To track code maintainability, we decided to use the average cycle time in each of three broad categories of work items based on their scope: Large, medium, and small items. Cycle time is the time it takes to deliver a single work item, on average.

Customer satisfaction proved to be the most challenging item to measure. We decided to track it in three ways. First, we would count the number of customer contacts with our help desk, as well as informal contacts with team members. Second, we would track the proportion of positive to negative customer experiences, as reported subjectively by callers to the help desk. Third, we would conduct periodic surveys of our customers to solicit their feedback.

Baseline measurements

Using historical information provided by our production support group, we learned that we were delivering code with a defect density of 0.065. Our application comprises about 2,000,000 source lines, so that corresponds to an average of 130 open issues at any given time. By this measure, our baseline performance was below average for the industry.

We had not been tracking cycle time previously. We had been tracking velocity in terms of story points, and running two-week time-boxed iterations. We found it was possible to use historical data to infer cycle times. However, when we started to do that we discovered our velocity observations were based on a fudged definition of “done” in order to get “credit” for completing work items within the bounds of the two-week iterations. When we counted the time required to deliver stories of large size all the way to the pre-production staging environment, we found our cycle time for large work items was actually 20 days. So, just the act of putting numbers on our performance had already exposed a problem area in our process.

We found our baseline performance for cycle time was 3 days for small items, 8 days for medium-sized items, and 20 days for large items.

There are a couple of general approaches to changing a process. One is to impose a radical change suddenly and comprehensively. This is sometimes called shock therapy. In some situations it can be effective, but it is also extremely disruptive. The contrasting approach is to start where you are and introduce changes incrementally. This is a practical approach in many circumstances, but sometimes “where you are” is so bad that there’s no way to “start” without making a few changes immediately.

While our overall approach was incremental, in order to obtain baseline measurements of customer satisfaction we had to take special action. We wanted to conduct customer satisfaction surveys every two months throughout the improvement initiative. To get a baseline, we had to conduct an initial survey at the outset of the initiative. We scheduled a meeting with key stakeholders to inform them of what we were up to, and they agreed to respond to the survey. The survey results boiled down to a satisfaction rating on a scale of 1 to 5. The initial results averaged 2.

From the help desk we were able to determine that about 40 issues pertaining to our application came to them monthly. They didn’t have a formal mechanism to ask customers about their experiences, but they had records that indicated about 1 in 10 callers reported a positive experience. We didn’t have data about the other 9 in 10 callers. We decided to use that as our baseline, and in order to track progress we asked the help desk to include a question about customer experience to their standard procedure.

Structure of the improvement initiative

We reasoned that without a time limit we might hack away at half-hearted improvement ideas forever, without necessarily getting to any sort of resolution. Therefore, we set a limit of 8 months on the entire initiative. If we didn’t reach our quality goals by then, or if we hadn’t learned enough along the way to justify changing our plan, then we would end the initiative at that point. We defined checkpoints at two-month intervals to re-assess goals and progress so that we would have a mechanism to terminate early or change direction, as appropriate.

To experiment with single process changes iteratively, we matched our improvement iterations to our delivery iterations — two weeks each.

Setting performance goals

We set goals for each of the measures we had decided to use to track improvement. We set out to achieve a customer satisfaction rating of 4 on a scale of 5, to reduce the overall number of help desk calls to 20 per month, and to increase the ratio of positive customer experiences to 7 in 10.

We set cycle time targets of 1 day (small items), 4 days (medium-sized items), and 8 days (large items). This would enable us to deliver production-ready solution increments in each time-boxed iteration without having to fudge the definition of “done” to give the appearance of completion. This would serve as our indirect measure of code maintainability. If the same team could complete work items of the same scope in less time than before, this would indicate the code was becoming easier to work with.

A couple of team members pointed out that if we improved our technical skills, cycle time would decrease. We talked it over and concluded it was likely our code quality would improve faster than our own skills improved, so changes in cycle time would probably indicate changes in code quality rather than changes in our own skills.

Some of us also thought that as our skills improved we would devote proportionally more time to testing, which would tend to keep cycle times stable. Not everyone on the team agreed with that idea.

There was another issue with using cycle time that worried us at first. By eliminating re-work and reducing time spent clarifying requirements, we would reduce cycle time. That would not say anything about improvements in code quality. It was an imperfect measure for the purpose. We would have to keep that in mind going forward.

In hindsight I would say we spent more time discussing cycle time than the other metrics because it is only an indirect indicator of code maintainability. We still thought it was better than using purely subjective judgments.

We set a goal to reduce defect density to 0.03, which would put us just ahead of average performance for the industry. That would equate to an average of 60 open issues at any given time, for a code base of 2,000,000 lines. Doesn’t sound like anything to brag about, but it was both realistic and achievable.

We didn’t achieve perfect consensus about every little detail, and we didn’t have perfect data in every area of measurement, but we felt we were ready to go forward.

Performance gap analysis

This proved to be the single most time-consuming preparatory task, although in hindsight we all feel it was well worth the effort. It comprised an end-to-end analysis of the entire delivery process, starting with activities in the business stakeholders’ area and flowing through all of the development work all the way to production support.

We actually mapped out the entire delivery stream, not just the portion of it for which our team is directly responsible. We knew that quality problems were systemic, and could not be solved just by improving some small aspect of programming or testing. We wanted to understand how our work fit in with the rest of the “system” of which our team is a part.

One of the things that fell out of this analysis was that a significant root cause of customer satisfaction problems had to do with the automated voice response menu callers had to navigate when they called the help desk. It wasn’t something we could fix by improving our programming or testing methods.

Help desk management expected 85% of callers to get the information they needed from the voice response system. They had staffed the call center accordingly. For whatever reason, though, the automatic responses answered only about 45% of callers’ questions. The rest of them queued up to speak with customer service representatives, who were overloaded. The primary cause of customer satisfaction issues turned out to be long wait times on “hold.” We communicated our findings to the appropriate people in the organization and proceeded with our quality improvement initiative.

In general, our gap analysis pointed to analysis, programming, and testing activities that seemed to be causing most of our quality-related problems. We embarked on an iterative improvement program in which we introduced a change, let the process stabilize with the change in place, took measurements, and then determined whether the change had resulted in any improvement in the metrics we were using to track progress. We’re still doing that.

Lunch in a parallel universe

Strangely enough, an old friend from a parallel universe called me a few weeks later to invite me to lunch. He brought with him a diagram of a process improvement initiative he and his team had undertaken at their company. They had approached the problem in a very different way than we had done.

The following week I invited him to lunch and brought with me the diagram of our improvement process. I thought the difference in our approaches might interest him. He had followed the steps about halfway down to the Performance gap analysis box when he remarked, “This is a lot more complicated than our approach. Didn’t it take a long time?”

“It’s still taking a long time,” I replied. “We’re still doing it. We’re two months into an eight-month plan, at this point.”

“Knowing you, I suppose the first thing you did was to implement TDD.”

“We considered it,” I said. Pointing to the diagram, I said, “See those steps where we select an improvement experiment?” He nodded. “Those are the points where we choose to try some specific practice, introduce a new tool, or make a change to our process.”

“And you haven’t adopted TDD?”

I shook my head, no. “Remember those preconditions I mentioned last week?”

“How could I forget?” he said wryly.

“Well, we determined that we weren’t in a position to succeed with TDD.”

“So you gave up on it, just like that?” he asked, with one eyebrow elevated in surprise.

“No, we decided to take pragmatic steps toward our goal of improving code quality,” I said. “At a strategic level, we’d like to reach the point that we can use a powerful technique like TDD. At a tactical level, we wanted to start making gains quickly.”

“Well, we just went straight to the good stuff. Straight to TDD from Day One.”

“Mm-hmm. Word on the street is that you guys canceled your improvement initiative.”

“Well, technically, yeah. But, hey: How did you come up with those preconditions, and we didn’t?”

I shrugged. “If I had to guess, I’d say it was because you pre-selected a specific technique and then measured your adoption of that technique. That limited your thinking. You were focused on the mechanics of the technique to such an extent that you never considered anything beyond it. In our case, we were examining any and all potential ways to improve code quality. That approach kept us open to alternatives, prerequisites, learning curve issues, and so forth.”

“That sounds like comic book psychology to me,” he said.

I shrugged. “Here’s another example, then: We identified problems that were outside our immediate scope of work, like the voice response menu thing. The reason you guys didn’t see problems like that is you were narrowly focused on just one small aspect of the work flow.”

“Help desk stuff isn’t our responsibility,” he said.

“Not your immediate responsibility, no, but it’s part of the same system as your team. Everything is inter-related. When you take a very narrow focus, you can’t see the rest of the system, and you overlook things that could help you.”

“How could that help us? We just do software.”

“Customers consider every part of their experience with software as a single thing. The help desk is part of that experience. If senior management keeps hearing bad news about ‘the software’ who do you think is going to get fired? The help desk people, or the ‘software’ people?”

“Well, I don’t know about that. It sounds like a stretch.” He continued, “Okay, then, if you didn’t want to make the big leap to TDD, what did you do?”

“Our gap analysis exposed opportunities for improvement in analysis, programming and testing. We chose improvement experiments in those areas.”

“Such as?”

“Such as bringing analysts and testers together very early in the process to collaborate directly on both requirements and test scripts.”

“Okay. What does that have to do with code quality?”

“We had learned that one of the root causes of escaped defects had been misunderstandings about the exact meaning of requirements. Analysts were setting expectations with stakeholders in a certain way, while testers were setting up test scripts with completely different expectations. At the same time, programmers felt as if they didn’t have all the details they needed, and they were under delivery pressure, so they guessed rather than slip the date waiting for answers. Sometimes they guessed wrong. Now, our test scripts are included with our requirements specifications, and everyone is singing from the same song sheet.”

“That sounds like a step in the direction of specification by example.”

“Yes, exactly. It’s a step. But…”

He held up a hand and said, “Don’t tell me: You don’t have all the necessary preconditions in place for success.”

“Right. Now you’re getting it.”

“Not really. I don’t get what’s stopping you from doing specification by example. Analysts, programmers, and testers are already on the same team, aren’t they? So, you’re good to go, right?”

“The stumbling block is that our test data is provided to us by a separate department. They extract data from production data stores, scrub it, and provide it to development teams. To do automated specification by example or ATDD, we have to have direct control of our own test data. Otherwise, our automated test cases aren’t repeatable. That’s an organizational barrier at the moment.”

“Okay, so if you aren’t doing TDD or ATDD, then what are you doing?”

“We’re holding peer code reviews on a regular basis.”

“In the past, you’ve told me that code reviews aren’t nearly as effective as TDD.”

“They’re more effective than what we were doing before, and they’re something our team was ready to start doing immediately. Remember, we want to make progress toward our goals. We don’t want to wait until all the ducks are lined up for some academically-perfect approach before we take any action at all.”

“Okay, okay, don’t get huffy.”

“Those were our first tactical efforts, and they did bear fruit.”

“And on the strategic level?”

“On the strategic level, we knew we had to start ramping up to those preconditions we’ve talked about. We’re supporting a mature app, and it’s pretty monolithic. We decided to address technical debt right away.”

“That sounds like a big bite. How did you address it?”

“In a couple of ways. First, we got some numbers to help us decide where to focus our efforts.”

“What kind of numbers?”

“Well, for one thing, we took historical data from our version control system to find out which source files and config files were most frequently updated. Then, we got static code analysis metrics from our build server to tell us which modules had the most dependencies on other modules, and which source code units had the highest complexity. Then we correlated the two lists and targeted the source files that appeared high on both lists.”

“Aren’t there source files that are crufty but aren’t updated frequently?”

“Isn’t there an old saying: If it ain’t broke, don’t fix it?”

“Okay. What about the source files most frequently touched for production bug fixes?”

“Those were automatically included in the list of most-frequently-updated files.”

“Oh, yeah, I guess they would be. Okay, so then what?”

“Then we carved out a slice of legacy remediation work in each iteration.”

“I thought that by the book we were supposed to do refactoring incrementally, in the course of working user stories.”

“We aren’t following a book. We’re driving toward our quality goals guided by quantitative metrics.”

“Fine, but if the team isn’t up to speed on techniques like refactoring, or if team members aren’t familiar with basic software engineering principles of one sort or another, then how were you able to get started with remediating your legacy code?”

“We got ahold of a book by Michael Feathers about working with legacy code, and…”

“Hang on, wait a sec, time out. You just said you weren’t following a book.”

“Right. We aren’t following a book, but we use books as references. I meant we aren’t following a recipe book.”

“Well, that’s splitting hairs.”

“Maybe. Anyway, we realized that the key thing to understand in order to get started with legacy remediation was how to break dependencies so that chunks of code could be isolated and covered by unit tests. It isn’t necessary to know everything about refactoring or about what clean code looks like to do that much. It turns out that just doing that much opens a lot of doors to further improvement.”

“And you’re seeing results already? That’s good. How are the metrics shaping up?”

“Well, we haven’t seen much change in defect density or cycle time as yet, but we feel as if we’re heading in the right direction based on the way the code base is evolving. We’ve already noticed less time being spent trying to chase down information about ambiguous requirements, thanks to putting testers and analysts together.”

“If the numbers aren’t improving, then what makes you think you’re on the right track?”

“It takes time for the numbers to improve. You know how it goes — any change results in a temporary reduction in performance. The metrics are serving another purpose: They are an early warning system in case we start to make things worse. That’s not happening, so we think that means we’re on the right track. Or at least, we’re on one possible good track.”

“How can you tell it’s the best possible track?”

“No one could possibly know that unless they tried an infinite number of alternatives. Since that’s not practical, and we’re interested in practical results, it’s irrelevant.”

“Can you simply declare something like that is irrelevant?”

“I guess I can, because I just did. Anyway, the cycle time metric has already helped us improve our short-term planning.”

“How so? You just told me it hasn’t changed much.”

“Right. But when we were doing our gap analysis based on our baseline measurements, we noticed that the cycle time to complete large work items was way out of line with those for small and medium-sized items.”

“It’s supposed to take longer to do a Big Thing than a Small Thing.”

“Sure, but we were seeing cycle times, respectively, of 3, 8, and 20 days. Something was out of whack with the way we were handling large-scale work items. It turned out that we weren’t decomposing larger work items properly. That’s one of the reasons we had resorted to fudging the definition of ‘done’ in order to get ‘credit’ for story points in each iteration. Today, we see those larger items as an indication that we need to discuss the requirements further. We rarely have anything larger than a ‘medium’ anymore. That’s already an improvement.”

“That makes sense. But you said you saved time by putting analysts and testers together early in the process. Didn’t that result in shorter cycle times?”

“Yes, it did, and that’s a great observation. Cycle time isn’t directly correlated with clean code. It’s just the best secondary indicator we could think of. If we discover a better way to measure maintainability, we’ll use it. Improving the continuous improvement process is part of the continuous improvement process.”

“Ha. That’s a mouthful. But I think I agree.”

“It wasn’t only the large work items, though. The medium-sized items were problematic, too.”

“Why? You said you were running two-week iterations. That’s 10 days. Ten is greater than eight.”

“A two-week time-boxed iteration contains 8 working days. The rest is taken up with process overhead.”

“Overhead? I thought you were using one of those magical ‘agile’ processes that doesn’t have any costs.”

We both laughed. “Every process has overhead. The question is whether you’re getting value in return for the cost of the overhead. When we have to play games with the definition of ‘done’, we aren’t really using the time-box as it’s meant to be used. We aren’t getting value from the overhead of the iteration planning activities.”

“So maybe you need to abandon the time-box model.”

“Maybe we will, when we’ve established the preconditions for success in doing so. For now, the next logical step for us is to learn to manage our work flow so that we can complete work items within the allotted time. One of the strengths of the time-box model is that the time-boxes drive us to improve our methods so that we can deliver incrementally and reliably.”

“In other words, you were smashing your thumbs when you fudged the definition of ‘done’ for large user stories…and it wasn’t the hammer’s fault.”

“Zackly. The best thing about this turned out to be the improvement process itself.”

“How so?”

“Well, you know what people say about double loop organizational learning?”

“Sure.”

“The empirical approach we came up with has proven to be a practical mechanism to build double loop learning directly into our normal work flow. And it’s pretty painless.”

“Yeah, we got some organizational learning out of our improvement effort, too.”

I smiled. “All you ‘learned’ was that TDD doesn’t work. And you ‘learned’ it by misunderstanding what happened. You’ve closed the door on using that practice in future. I wouldn’t call that a ‘win’."

“Well, we tried,” he said.

I shook my head. “It reminds me of the story of the dog who was scared of cracks in the sidewalk.”

“What in the world are you talking about?”

“There was this dog, and one day when it was young it happened to be sniffing a crack in the sidewalk just at the moment a truck backfired nearby. For the rest of its life, it was afraid of cracks.”

“Very funny. For your information, I took the diagram back to the shop and showed the guys those preconditions you added. We’ve had a retrospective about how things went the first time around. We might go for TDD again.”

“I hope not.”

“What? One minute you’re for TDD, and the next minute you’re against it.”

“Not the point. I hope you’ll go for some specific improvement goal. If TDD falls out of that effort, or if something else does, it’s all good. Your problem the first time was that you went for a specific practice instead of a genuine performance goal. Your target should be an outcome, not an activity. When you pursue an outcome, practices will fall into place.”

“Right, right, I get it. Don’t worry. Getting back to your case: You don’t expect to see improvements in the numbers immediately?”

“Not for cycle time or defect density. The exception is customer satisfaction.”

“What do you mean?”

“Customer satisfaction survey results are following the opposite curve from the other measures. We just had our second survey, and the average response was 4.75 on a scale of 5. Our baseline measurement was 2 out of 5.”

“That must be due to the improvement in the automated voice response menu.”

“No, the help center hasn’t changed anything yet.”

“So, why are customers so happy?”

“At the risk of sounding like a comic book psychologist again, I’ll say it’s because they’re excited that someone is actually paying attention to their needs for a change. Sooner or later, things will settle back down to normal, and we’ll see responses that reflect their true level of satisfaction. That’s when we really want to see a 4 out of 5. Right now I think we’re just seeing early enthusiasm on their part. The shine will wear off once quotidian realities reassert themselves.”

“Quotidian realities, eh?”

“Those are the ones, yeah.”

“Of all the realities that might reassert themselves, it’s the quotidian ones that eat your lunch.”

“You said it. Speaking of lunch, here comes our food.”

3 thoughts on “Measuring continuous improvement

  1. Shorter version in the form of a haiku

    Measure the outcomes
    And not the activities
    To know your progress.

  2. It is important that an aim never be defined in terms of activity or methods. It must always relate directly to how life is better for everyone.
    ~ W. Edwards Deming

    Quoted from while I’m translating it.

Comments are closed.