The Curious Incident of the Magnetic Scotty Dogs in the Night-Time

Long, long ago, in February of 2020, Joshua Kerievsky of Industrial Logic fame published a blog post entitled, “A Tale of Two TDDers.” In it, he described a production issue which he said was based on real-world experiences in a real code base. He described two team mates, David and Sally, who took very different approaches to solving the problem.

Although based on a real incident, the story in this blog post is not the actual tale; it’s a contrived version in which David and Sally separately solve the same problem two different ways. David used a test-driven approach and took several hours to refactor a bunch of related code. Sally pushed a quick fix into production and the team immediately added the missing test case(s). Of course, in real life the team solved the problem just once. But Josh wanted to set up a question for readers: “Which programmer would you prefer on your team?”

The post generated quite a bit of enthusiastic discussion. So much, and so enthusiastic, that just 6 days later Josh posted a follow-up post, “One Defect, Two Fixes” to try and clarify. He mentioned the first post had “generated a lot of interesting discussion, some of which bordered on deep misunderstanding.”

I must admit I had missed the point of the first post, and my comments on it went off on a tangent just as several others did. The follow-up post presented a straightforward description of the actual event, which was much clearer for me than the original presentation.

But the revised version still generated some heated discussion. Some of it became, maybe, too heated, and people who normally get along well started to misunderstand each other and miss each other’s points. I expressed dismay at this, and Ron replied with this pithy Tweet:

Hence the possibly non-obvious title of this piece. The reference isn’t so clear that I can assume everyone gets it, because the book in question isn’t so famous: The Curious Incident of the Dog in the Night-Time. And Josh’s story is about an incident. In the night-time. And then Ron’s tweet, right? Yeah. Don’t worry, I’m not quitting my day job.

I have three observations, or directions for potential further discussion, about the story.

Observation 1

The original version of the story had the flavor of a binary choice – either you abandon all good practices and hack up a solution in the interest of “speed,” or you stop the world and refactor everything. (My impression, anyway. That wasn’t actually the story at all. That impression was a misunderstanding on my part.) Which programmer would I prefer on my team? Well, neither, if those are really the only two alternatives. Come to think of it, I think my initial reply to Josh’s tweet about the post was the single word, “Neither.” I’m guessing that’s not what he meant by “a lot of interesting discussion.”

In subsequent versions of the story, Josh clarified that the question was inspired by an actual incident and he provided the details. The “real” version was much easier for me to relate to than the genericized version. I appreciated the fact Josh took the time to write it up.

My observation here is that we don’t really face binary choices like that. A “TDDer” wouldn’t think about doing the fix manually. He/she would know viscerally that a test-driven approach was likely to result in a fix in less time and with greater safety than a manual approach. We tend to depend on whatever methods we truly, deeply believe in when under stress, notwithstanding the practices we may espouse verbally when the sun is shining. We just don’t have time to do anything except what we “know” works.

On re-reading that paragraph, I realize it may sound critical. I mean it as an invitation to look in the mirror. A lot of us say we believe in certain practices, but what do we actually do when the heat is on? Do we depend on those practices to carry us through? Or do we depend on other practices when we absolutely have to get the job done? What practices do we really think are effective? Actions speak louder than words.

At XP Days Benelux 2009, Marc Evers and I facilitated a workshop entitled, “Things never change no matter what we do.” It was about using causal loop diagrams to understand complicated situations. The participants were 35 software developers who use Extreme Programming. We didn’t have a good mix of different roles and backgrounds. It was all developers, and they had a sort of Borg-like hive mind. The developers came to a point where none could think of any possible cause for technical debt besides “Management pressure to deliver.” It didn’t occur to any of them that their own tendency to set aside good practices when the pressure was on led to technical debt. I paused the workshop to make that very point: Whatever we actually do when under stress reflects what we deeply believe is most effective.

But wait…doesn’t TDD slow you down in the short term, with the promise of long-term benefit? If so, then it would slow you down during a bug fix activity, when you simply don’t have time to wait for a long-term benefit to materialize.

I remember participating in an Elephant Carpaccio exercise at an agile conference several years ago, facilitated by Alistair Cockburn, who invented the exercise. The idea is that we complete a very simple programming problem by slicing the requirements as thinly as possible, working in 9-minute iterations.

At that time, Alistair said some 400 people had gone through it. By 2020, thousands of people must have experienced it. Here’s a nice explanation by Henrik Kniberg: Elephant Carpaccio Facilitation Guide

People mostly work in pairs unless there’s an odd number of participants. In my case, I was in a group of three. We had everything going against us: Different languages/countries, different age cohorts, different sexes, different bakgrounds. We didn’t know each other. We used a guy’s Mac laptop with a custom keyboard layout; the other two of us weren’t familiar with Macs.

Participants could approach the exercise any way they wished. In general, teams either hacked up a solution fast, or they followed some XP practices but ran out of time. Our team followed TDD and pairing rigorously and also completed the exercises. We were the only team to do so.

In the debrief, Alistair suggested that TDD might not add value over higher levels of testing, as it tended to slow things down. On Henrik’s site, comments on the facilitator’s guide tend to agree:

“…when we did the EC exercise we soon abandoned TDD. It slowed us down too much.”
“In a 40-minute programming exercise even TDD experts can’t regain the investment. Most people I meet say the payoff time for TDD is in the range of weeks, which matches my experience too.”
“…you can’t run TDD in this session as it’s too short.”

Well, it doesn’t match my experience, FWIW, either during the exercise or in the Real World®. Statistically, Alistair’s interpretation of the results of our session was correct – only one team both followed good practices and completed the exercise. Up to that time, it was the only team from all such sessions that did so. That’s statistically insignificant; an outlier. The general result was that TDD was too costly to work with 9-minute iterations.

Socially or culturally, the result is that people assume they can’t rely on good practices when the pressure is on. That’s when trouble comes.

I think the outlier offers an interesting insight. Why didn’t that one team experience the same problem as others who tried to apply TDD in a short timeframe? What was different about them? My observation is that when programmers gather together for a workshop or a code dojo or similar, they tend to burn up a lot of time (initially, at least) talking about and arguing about good practices.

In May of 2010, I had the honor to participate in the first Certified Scrum Developer class, conducted by Ron Jeffries and Chet Hendrickson on the LeanDog boat in Cleveland, Ohio. It was a kind of dry run or maiden voyage, not a normal class, and students included such luminaries as Jon Kern, George Dinwiddie, Adam Sroka, Jeff Morgan, and others. It was a classic case of too many chefs and not enough cooks. We wasted a great deal of time discussing good practices instead of applying them. We got very little work done. Even practitioners at that level of skill are prone to this. I’ve seen the same behavior in just about every code dojo and similar event I’ve ever participated in. Maybe it’s a developer thing.

Going back to the Elephant Carpaccio exercise, our team of 3 applied the practices; we didn’t compare individual flavors of TDD or pairing. We didn’t get into circular debates about the finer points of XP. We just steadily moved along step by step without rushing and without pausing. We were able to have brief design discussions and even sketch out ideas on paper. We absolutely followed classic-style TDD to the letter. We paired, with the third person acting as Product Owner, and switched roles after each iteration. We completed the exercise just in time, with no time to spare, but also with no stress. The key is to do the practices rather than to jabber on and on about personal preferences regarding naming conventions or when to break out classes and so forth. JFDI!

So, I have to wonder what was going on with the other teams in the session. Did they do the classic “developer thing” and chew up their 9 minutes in discussion, or did they really apply XP practices? I think our team’s result shows that there’s nothing inherent in the practices that makes them unsuitable under time pressure.

And does it take weeks to realize the benefits of TDD? I can’t question the experiences of others; I’m not them. But they can’t question mine, either, and I can say I’ve seen benefits within minutes when showing teams how to start refactoring code and test-driving changes. TDD doesn’t actually slow things down. Indulging in circular debates about TDD slows things down. And software developers love circular debates, especially about matters of world-altering importance like indentation or curly brace placement. They will fall into such debates even when they’re doing 9-minute iterations in an Elephant Carpaccio workshop.

Anyway, the choice facing the Industrial Logic team wasn’t whether to write tests or not, or whether to refactor or not. Those are not open questions for XP practitioners. The choice was about scope – how far do we need to go with additional test cases and refactoring to accomplish the fix quickly, cleanly, and safely? Clearly, there was no need to stop the world and refactor everything. There is usually a practical and sensible middle ground between the two extremes.

Observation 2

Given that we have a team proficient in TDD and other good practices, what led them to set those practices aside to accomplish this fix? From one of Josh’s articles:

“Those of us who develop the Greatest Hits product primarily pair or mob program, use Test-Driven Development (TDD) to produce most code, refactor liberally, check in to master (we don’t use branches) and deploy/release to production continuously. Our code base has high test coverage, including thousands of valuable, high-speed microtests, over a hundred acceptance tests, under a hundred Selenium tests and, for our continuous deployment safety, a few deployment tests. Yet even with all of this safety, we get the occasional defect. And that brings us to our story.”

This is all good stuff, and in fact I think it brings us to another story, as well. That other story is that we who use TDD extensively tend to write mostly example-based test cases, often including data-driven or parameterized cases, which are a flavor of example-based cases. The main pitfall with example-based cases is that we’re likely to overlook an edge case or exception condition here and there. As it isn’t possible to predict everything that might happen in production, there’s always a risk of unpleasant surprises. And yet…

“It’s also important to note that the first fix wasn’t simply hacked into the code and checked in with a ‘hope and pray’ strategy. Manual testing was done to first replicate the problem locally.”

Okay, let’s pause here and ask, with curiosity and not criticism, why a team that normally emphasizes TDD would approach the problem in this way. I wasn’t there and didn’t live through the incident with them, so I can’t criticize. But I am curious. Here’s why:

From the time I first learned TDD, I remember people saying that a production defect is an indicator that we’re missing one or more test cases. The process of reproducing the reported behavior locally is to write the missing test case(s) and see that they fail due to the bug. That is how you reproduce the behavior locally. You don’t spend as much time, or more, to “test manually” and then decide whether you have spare time to add the missing cases.

(Of course, as you don’t have a test case to point you to the soure of the error, you’ll have to spend some time narrowing it down. You might use a technique like the Mikado Method or core-and-slice to get there, but that time doesn’t count against “TDD time,” as you haven’t yet gotten to the point of writing the fix.)

The point is we don’t have time not to do it this way. This is true even when under pressure to get customers up and running in production. Come to think of it, especially then. It’s part of the discipline of this approach to the work. Reading on:

“Next, the code was changed and more manual testing was done to see that it fixed the issue. Finally, all tests were run to ensure that nothing had been broken. The fix was live soon after the check-in, since our continuous deployment pipeline pushes all changes that pass our build (including our large suite of tests) automatically to production.”

It seems to me that with that sort of pipeline in place, it’s all the more feasible to test-drive the fix and let the standard process carry the code through to production with all the usual safety checks. It would be safer and probably faster than deviating from the process to test the fix manually. And if the manual approach were faster but introduced a regression, then it wouldn’t matter that it was faster.

Josh said they ran the whole test suite and it passed. That’s the same test suite that missed the first bug, right? Hmm. But as I said, I wasn’t there and I don’t know if there were other factors that led the team to act as they did.

Josh is frank about the fact the team isn’t always as rigorous about good practices as they’d like to be. In explaining how the bug got created in the first place, he writes:

“I was not happy about this new null check in the setBookmarkFor(…) method! We already had a null check for pageKey. Adding that new null check was not even part of refactoring, since it’s new behavior. It also appears that we were not very diligent that day about test-driving that null check into the code, as there is no new test for it, which is quite rare for us.”

And the commit comment for the fix: “Added back a null check on AlbumId. We’ll need better test coverage for this.” Not to belabor it, but I think that had they reproduced the error by writing the missing test case, they would have had an easier time of it.

This isn’t a criticism of them, but rather a learning opportunity for other teams. It may even be encouraging for other teams, as very few others in the world are as skilled in these methods as Industrial Logic. If even they can make a mistake like this, then it’s certainly understandable that others could make mistakes, too. It’s nothing to be ashamed of.

But what good is a mistake if we miss the opportunity to learn from it? I think there’s more to learn than the original question Josh posed.

Observation 3

There have been some interesting advancements in development practices since the time of that incident. Josh’s description of the way the team normally works or worked, the extent of their test suite, the use of continuous integration, trunk-based development, and automated deployment, are all good. But with the growth of service-based and microservice-based solutions and elastic cloud infrastructures, it has become impractical (if not impossible) to test everything before going live. With that in mind, there are a few things we might do in addition nowadays. And it turns out we can do these things for all kinds of software; not just microservices. They include:

Design our solutions to be observable, and monitor production proactively using a tool such as Honeycomb. In fairness, this was not available at the time of the incident. I mention it as a potential lesson going forward.
Employ a phoenix server strategy to avoid configuration drift and to reduce opportunities for malware to be introduced on production instances. It’s fair to note this was not a “thing” in 2012.
Test in production using one or more techniques such as canary deployment or A/B deployment.
Use an external service to poke at the production site more-or-less constantly, providing insights the team can use to improve the product without taking up a lot of their own time.
Write property-based tests to supplement our example-based tests. In the case Josh reported, a property-based test definitely would have exposed the defect before deployment, as nulls and empty string arbitraries are always generated by tools like JQuickCheck and JQwik (Josh shared Java source code, so I mention Java-based tools here). It’s fair to note that this category of tools was not mainstream at the time of the incident.
Use mutation testing prior to deployment. There’s a mature tool for Java called PIT. This tool was mainstream at the time of the incident. This type of testing may or may not have exposed the defect prior to deployment.
Practice whole-team Exploratory Testing regularly, such as weekly or biweekly. Josh mentioned the team often used mob programming and pair programming, so this would be a natural fit for their work flow. I know the people involved and they are familiar with this practice. (It’s possible they were doing it and it just didn’t come up in the story. It wouldn’t necessarily catch a bug of this kind, but it might.)
Take a cue from our mutual acquaintence Arlo Belshee, and practice read by refactoring routinely. Over time (and not very much time, actually) hard-to-undertand sections of code become easier to understand, better-named, and easier to modify (or “fix”) safely. As developers spend more time reading code than writing it, improvements in code readability have a greater pay-off than improvements in coding speed. In fairness, read by refactoring was not a “thing” at the time of the incident.

Conclusion

Others may see different things than I in this incident and the way the discussion played out, but here are my key take-aways:

Lesson 1: Don’t take feedback personally

It’s useful to examine incidents like this one to see what we can learn to improve our work and to improve the general state of the art of software development. To do that, it helps if we can resist the natural human tendency to take negative or apparently-negative feedback personally.

Each of us has a personality and a unique way of expressing ourselves, and sometimes that can be taken as a personal affront to others who have different personalities and different communication styles. It can be hard to do, but let’s try to keep that in mind so we can learn from each other’s experiences.

Remember the second of Miguel Ruiz’ Four Agreements, which reads, “Don’t take anything personally.”

Lesson 2: Be frank (depends on Lesson 1)

Historically, programmers have a well-earned reputation for arrogance and snarkiness. They also tend to be sensitive and thin-skinned. The developer community has tried to correct this deficiency in tech culture for many years.

Norman Kerth’s Retrospective Prime Directive expresses an idea that is stated frequently in various ways: “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”

This is the right approach in my view, but I’ve noticed the pendulum has swung so far in the opposite direction that people hesitate to criticize anything at all, or even to ask why things were done in a certain way. Literally anything anyone does is just fine, because everyone is a Good Person®.

Well, you know, people have been modifying software without tests for decades, and they’ve been quite successful. Have they? What’s the general state of software quality in the world today?

It’s important to respect each other and to remember our work doesn’t define us as humans. It’s also important to be able to speak about things that bear improvement, without worrying that the feedback itself will become a target of criticism. When we attack the feedback, it tends to shut people down and we lose the opportunity to learn from their observations and insights.

Indeed, a central element in Industrial Logic’s own High Performance via Psychological Safety pamphlet is “Invite Radical Candor.” It’s right in the center of Page One and is the only concept that’s presented in a colorful graphic. Doesn’t that suggest it’s important?

Lesson 3: Depend on good practices when under pressure

The assumption that core XP practices such as TDD only “work” under ideal conditions is, unfortunately, very widepread. It can lead to cases when people set aside good practices under pressure, as in this incident. Even highly skilled practitioners are susceptible to this.

When results of exercises like Elephant Carpaccio are misunderstood, it feeds this general assumption. I say “misunderstood” because most people assume the poor results of using TDD with the 9-minute iterations are caused by TDD itself.

My interpretation is the poor results are caused by developers acting like developers. We are human, after all, and we enjoy talking about topics of mutual interest while we work. When you pare down the iteration length to 9 minutes, there’s no slack time for socializing. I suspect that is the reason most participants in the exercise have difficulty applying good practices under the time constraint.

The previous point about developers acting like developers doesn’t tell the whole story about what (probably) happens during Elephant Carpaccio sessions. Referring to the work of Michael “GeePaw” Hill, it seems most TDD practitioners don’t quite grasp the idea of small.

Elephant Carpaccio makes us slice the “stories” very, very thinly. It seems likely that programmers doing the exercise have to spend some time figuring out how to slice the stories small enough for the short iterations.

Practitioners who normally slice the work thinly would be in a better position to JFDI. For others, the friction they experience in trying to slice thinly could fuel the general assumption that TDD “takes too long.” So, maybe a lesson here is to be careful not to confuse correlation with causation. At least, don’t take that assumption back to work with you and start skimping on good practices.

Lesson 4: Pragmatism before purity (except after C)

Object-Oriented programmers strongly dislike null checks in their code. Ideally, it’s possible to design solutions that don’t produce null references. People try to do this with Java, but Java persists in being Java.

I think this incident reminds us that when we’re working with tools that can generate null references, we must check for them, per Murphy’s Law. NullPointerExceptions have bitten Kotlin programmers who assumed they didn’t have to worry about it, as Kotlin is null-safe by default. But it’s the existing libraries that make the JVM a powerful platform for applications, and those libraries were written in JVM languages that are not null-safe.

In this case, the source language was Java, which is quite the opposite of null-safe. Both the libraries used and the custom application code could generate null references. It was only a matter of time before one popped up in production. Even if null checks feel itchy, let’s remember the characteristics of the tools we’re using and take them into account.

Lesson 5: Keep up

Industrial Logic has been at the forefront of software development practice since its inception. The incident Josh reported dates from 2012. Those who work elsewhere: Notice the list of practices Josh enumerated in his article. How many of your teams are using even half those practices today, in 2020? The chances are good that your team and your software organization are far behind that standard.

That was the good news. The bad news is that 2012 is long gone. A lot has happened since then. Your target should not be to work as Industrial Logic did in 2012. Your target should be to work as a contemporary software team should work today, with an eye toward tomorrow.

Consider the items listed under Observation 3 above. Several of them did not exist or were not mainstream circa 2012. Today, they are almost baseline expectations for professional software teams. They will certainly be assumed practices by 2025.

When Charity Majors created Honeycomb, she didn’t only create a software product, she also created a new category of software that supports an extended range of responsibilities for software teams. Today, development teams aren’t limited to throwing code into production. They are also responsible for customer/user happiness in production. Honeycomb, and several new products it has inspired, enable proactive operations support, as opposed to reactive support. To do that, application code must be instrumented for observability. That introduces new software design principles beyond those that we applied in 2012 and before.

In Josh’s description of the 2012 incident, Tim immediately responded when a customer was dead in the water due to a null reference. He quickly got things straightened out. We may debate whether he could have or should have taken a test-driven approach to the fix, but on a more fundamental level the team was responsive to customer needs in production. In that regard, they were ahead of the times in 2012.

But they were still reactive; they didn’t do anything unless and until a production issue was reported. Today, it’s becoming expected that product-focused teams are proactive in managing operations. The mode of work described in the incident report is no longer state-of-the-art. Things don’t stand still in this industry.

There is no perfect solution to the problem of bugs. As we move complexity out of application modules and into the infrastructure, we make it easier to test each module thoroughly before release, but we also make it easier for unexpected things to happen in production. Look up “gray failures,” for example. We need to adjust our work practices and tooling to account for this reality.

Lesson 6: Do All the Things®

In his 2014 Commencement Speech at the University of Texas, Admiral William H. McCraven describes 10 life lessons we can take from Navy Seal training. The number one lesson is that “if you can’t do the little things right, you’ll never be able to do the big things right.”

He tells the story of how they were required to make their beds perfectly every day. The candidates thought this was silly, as they were training to become warriors. In the end, they understood the point of it.

A similar idea applies to our own work. There’s a huge pay-off when we exercise self-discipline in the small things.

Perform every task, every time, with the highest skill you can.
Don’t assume bugs are an inevitable fact of life; don’t tolerate bugs; follow Arlo Belshee’s advice about “bug zero.”
You need more than a suite of example-based functional checks, even if your suite reports high line coverage numbers. Learn and apply property-based testing and mutation testing.
Don’t be afraid to touch code; make it easy for others to scan the code and understand it; follow Arlo Belshee’s advice about relentless refactoring.
Make test suites easy to read and expressive of the “story” of the application; follow Kevlin Henney’s advice about structuring test suites and naming test cases.
Working in the small yields benefits in the large; be serious about it; follow GeePaw Hill’s advice about microtests.
Reduce the complexity of coordinating code changes; follow Paul Hammant’s advice and do trunk-based development.
Learn testing fundamentals and start doing Exploratory Testing sessions regularly.
Build your code so that it’s possible to see what’s going on in production and drill down into potential problems before customers are affected.
Trust well-vetted technical practices to carry you through the tough times; don’t assume they “won’t work” under time pressure.

Cutting corners with respect to good practices in the interest of “speed” doesn’t help with this. It only allows subtle bugs to hide within application modules, to be discovered when edge cases occur in the wild. It enables Hyrum’s Law to come into play; a clever client exploits an unintended behavior of your API, and forevermore you can’t refactor the code behind it because doing so will scuttle a customer.

And be grateful when someone offers constructive feedback, even if their personal communication style doesn’t please you.