Throughout my career, people involved with software development and support have done their best to make software perfect; or at least to eliminate all the bugs from it. We’ve come up with more and more design principles, development techniques, testing methods, system monitoring and self-correction schemes, and tools aimed at avoiding, preventing, detecting, working around, recovering from, and removing bugs from software.
It doesn’t seem to be happening. No matter what we do, our software still has bugs. There are two main issues, I think: (1) We don’t know in advance when a bug will make its presence known, and (2) some bugs cause an extreme amount of damage or result in high, unplanned costs. And there are two general approaches we can take to handling those realities: (1) We can ensure no software contains any bugs, or (2) we can ensure our systems and our lives can recover from the damage caused by bugs at reasonable cost and in reasonable time.
Until now, we’ve mostly been taking approach (1). The result is a world in which software is ubiquitous and unavoidable, and in which hardly anything works entirely correctly the first time we try to do it, or the majority of times we try to do it, or maybe ever. High-tech cars suddenly decide the throttle should be fully open and the brakes disabled. Websites take users in a circle, never giving them the option they’re looking for, and no fall-back customer service phone number is provided. First World problems? Sure. Necessary problems? Not so sure. One of the promises of technology is that it will make our lives easier. Instead, everything we do that depends on technology turns into a dance…two steps forward and one step back, again and again. Often, the only way to get something done is to drop out of the dance and step around the technology.
One example among an endless series: At the supermarket the other day, I joined a check-out line (that’s a “queue,” for you Brits and similar out there) 20 people deep. Only one of the 15 available manned registers was open. The four self-service check-out lanes were available, but no one was using them. Store personnel approached customers standing in line and offered to help them through the self-service check-out procedure. That procedure is so counterintuitive, and the reliability of the systems so low, that no one wants to use them. Encouraging customers to use the self-service lanes by deliberately under-staffing the manned lanes hasn’t worked. People would rather spend 30 minutes in line than deal with the damned frustrating machines. That includes customers like me, who work with damned frustrating machines all the time, and ought to be used to it.
Bugs, proof, and the impossible
It occurs to me there’s a built-in flaw in everything we’ve been doing all these years. People have been applying the same skills, assumptions, mindsets, methods, and tools to removing bugs as they used when they created them. It’s a familiar pattern in human activity: “If X isn’t working, then do more X, and do it harder!”
Colleagues who specialize in software testing frequently remind us that it’s impossible to prove any software works. As the software world grows ever more complicated, it becomes ever harder to avoid, detect, and eliminate literally every conceivable bug. A solution that comprises, say, fifty million autonomous AI-enabled nanobots that can self-organize into mini-swarms and self-configure to deal with problems their designers never anticipated, capable of discovering and connecting to any other devices within radio range whose communication protocols they can decipher and mimic, all deployed on a collection of interlocking networks, with most of those devices having been developed and deployed in a rush, under schedule and budget pressure, programmed mostly by people who don’t apply sound software engineering methods and who steadfastly insist they don’t need to learn sound software engineering methods because they’re too cool for that, in a world populated by highly-skilled hackers bent on stealing, compromising, or ransoming anything and everything they can break into, will naturally be orders of magitude harder to secure and verify than, say, a typical monolithic, centralized business application of 1970s vintage, running on a mainframe behind locked doors and not connected to an outside network. The impossible is becoming impossibler by the minute.
Making the impossible possible. Or not.
When trying to achieve a seemingly-impossible goal, humans may look for ways to change the world around them to make the goal possible after all. No doubt the majority of humans alive in 1962, when President Kennedy announced the intention to send a man to the moon and bring him safely home by the end of that decade, would have dismissed the idea as “impossible.” Yet, when sufficient money, talent, effort, and creativity were applied to the challenge, humans redefined the word “impossible” in that context. Build a long enough tube and put enough rocket fuel in it, and it can “slip the surly bonds of earth” (whatever that means). Well, there might be one or two other details involved, too, but you get the idea.
But not every impossible thing can be made possible. When we are pretty sure a thing really is impossible, then the logical course of action is to set our sights on some other thing. You can beat your head against a wall only until one or the other breaks. If the wall gives way, then it’s up to you to decide whether the outcome was worth the headache. But at least you know it was possible. Good on you!
Once you’ve learned the wall is harder than your head, it’s time to go beat your head against something else. If you continue to beat your head against the same wall after that, it’s a personal problem. For whatever reason, it seems to be difficult for humans to recognize when they’ve reached that point. They just keep doing more X, and harder.
The impossible gets in the way of Good Enough
Let’s revisit something I mentioned earlier: Colleagues who specialize in software testing frequently remind us that it’s impossible to prove any software works. Of course, that statement implies an extreme, philosophically-pure definition of “proof” and an extreme, philosophically-pure definition of “it works.” On a practical level, nothing (software or anything else) really has to be absolutely perfect under all conditions; it just has to meet a need, preferably without annoying the user in the process. Similarly, it isn’t necessary to “prove” mathematically that a thing (software or anything else) meets all needs in all contexts for all people; it just has to meet one need for one person in one context, or at least to appear to meet that need from the perspective of that person, for the duration of the task in which they are using the thing.
A software tester might tell me, “That Android app you wrote appears to do what you intended it to do, but watch what happens when I crush your phone in an industrial press. There, you see? The app stopped working. That goes to show programmers don’t think about edge cases.” To which I would reply, “You’re paying for that phone. You know that, right?” And the tester would say, “What? You don’t have insurance? See, another edge case you didn’t think of! A good lesson for you, well worth the price of the phone. You’re welcome!”
The head submits to the inexorable wall
The phone crushed in an industrial press is what happens when we pathologically seek philosophical perfection without considering context. If, instead, we accept the imperfection of software as a part of Nature, then we have a couple of choices.
First, we can give up. We can stop using software. The ancient Persians, Egyptians, Mayans, Greeks, Romans, Chinese, Arabs, Indians and probably many others did fairly well without software. They achieved remarkable feats of engineering and architecture, and understood advanced mathematics and astronomy. Evidence has been found of things like batteries, capacitors, large-scale steam-powered and water-powered engineering works, and much more. The antikythera is an ancient Greek computer – an analog one, without software. Leonardo da Vinci conceived and designed machines like submarines and helicopters without using any software. Effective financial accounting has been done using knots in string and impressions in clay. People have constructed clockwork, electrical, or magnetic devices for centuries, for purposes ranging from instilling awe in the faithful to navigating across the oceans to casual entertainment, all without software. When the Apollo 13 astronauts had to adapt to unforeseen events and ride home in the Lunar Excursion Module still attached to the Command Module, they calculated the revised corridor burn using a slide rule.
Seems to me we could do quite well without software. I only say so because we have done quite well without software. Entire civilizations have been built without software. In contrast, software hasn’t built any entire civilizations.
But I suspect it isn’t very likely humanity will stop using software; that people the world over will wake up one fine day and collectively decide to shut it all down once and for all, to say goodbye to the self-service check-out lanes at the supermarket and re-learn how to drive their own cars, and make all those professional hackers go out and find real jobs. History doesn’t appear to offer any repeating patterns of that kind.
That leaves us with a second choice: Step back from the wall of trying to make perfect software, and start banging our heads against the wall of trying to make software resilient and the damage it causes recoverable. We know our software will always be flawed in ways that are difficult for us to anticipate but easy for professional hackers to exploit, or for algorithms to discover by brute force. We accept that reality. If we can’t avoid it, the question becomes how do we live effectively with it?
-ilities for the modern world
Wikipedia’s list of system quality attributes (what we sometimes call -ilities) included 82 entries as of this writing (25 Jan 2019). Of the three qualities I consider crucial for beating our heads against the new wall, only one is present in that long list: resilience. The other two are observability and replaceability. By focusing on those three -ilities, we can aim to build software that has a chance to survive the effects of its own imperfection as it operates in a world that is at best indifferent and at worst hostile to our design intent.
According to ResiliNets, a wiki focusing on network technology, resilience is defined as “the ability of the network to provide and maintain an acceptable level of service in the face of various faults and challenges to normal operation.” I like this definition. Given today’s world, where the boundaries between things like networks, servers, and applications aren’t what they used to be, I would extend the concept to all components of a software environment and the overall environment itself.
Now, if we intend to beat our heads against the Wall of Resiliency instead of the Wall of Perfection, there are certain implications for our work. When we’re wearing our Programmer hats, we want to design network architectures, solution architectures, and software applications with resiliency as a core design goal. That means we don’t want to throw exceptions all over the place and take no corrective action, and not even log anything useful (that’s the status quo in business application software today, by the way). When we’re wearing our Tester hats, we want to probe systems’ ability to “maintain an acceptable level of service” no matter what we throw at it. We aren’t (only) looking for functional errors in the application logic; we’re looking for the system’s ability to survive the unexpected in whatever form it comes.
There is a risk in focusing single-mindedly on resilience. It’s quite possible to build systems that recover quickly from anomalous events, hardware failures, spikes in demand, and so forth, and that can recognize patterns that suggest fraudulent usage or denial-of-service attacks. But when we do so, sometimes our systems are too good at recovering themselves or working around problem spots, and not so good at helping us ferret out the cause. That leads to gray failures: “We define gray failure as a form of differential observability. More precisely, a system is defined to experience gray failure when at least one app makes the observation that the system is unhealthy, but the observer observes that the system is healthy.”
In case it isn’t already clear, the “one app” is you attempting to do online banking, and the “observer” is the part of the system responsible for keeping the online banking application up and running. It’s up and running, all right. It’s in a hard loop. It’s resilient as hell. Nothing can kill it.
Come to think of it, killability or disposability is missing from the Wikipedia list of system qualities, too. It’s one of the canonical 12 Factors for designing cloud apps. But that’s a lower level of design detail than the three key attributes I want to mention in this piece. For our (my) immediate purposes, the problems that can arise from a focus on resiliency lead directly to the next system quality of interest: observability.
The concept of observability comes from Control Theory. According to Wikipedia, it’s “a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” As cloud-native software became mainstream, challenges in monitoring and identifying root causes of issues led practitioners to bring the concept into the software realm. A very good 2017 article on Medium by Cindy Sridharan explains the differences between Monitoring and Observability. You can find a great deal more material on the subject of observability in software systems online now, two years on.
Sridharan observes (if I may borrow the word), “Building ‘monitorable’ systems requires being able to understand the failure domain of the critical components of the system proactively. And that’s a tall order. Especially for complex systems. More so for simple systems that interact complexly with one another.”
Charity Majors, another expert in the area of observability in scaled cloud environments, often pushes back against the overuse of aggregating data in logs. The aggregated data can, at best, reflect problems the designers could anticipate. Scaled cloud environments are so dynamic that it isn’t reasonable to expect designers to anticipate all possible sources of issues in advance. Instead, what’s needed is a way to analyze raw data at the event level.
The reason this has become something to “push back” against is that tools for aggregating logs have proliferated. We all know that people love to latch onto tools as a first resort when they have problems, rather than stepping back and figuring out what the problems really are and then choosing tools accordingly, if they determine tools are needed at all. So now we have to fight that battle before we can guide people to the real battle. Sort of like having to deal with the Lannisters before we can turn our attention to the White Walkers.
Like everything else, observability has a downside as well as an upside. How are we mere mortals supposed to make sense of a flood of raw event-level data that isn’t organized or grouped in any particular way? Like everything else, the answer is: Not without effort. The good news is that contemporary advances in data analytics and machine learning make it feasible to probe and explore live cloud-based systems in realtime as well as via historical log data; a way to discover sources of issues and complicated combinations of conditions that lead to problems no one could have anticipated at design time. But it isn’t “free,” as calculated in the currency of “effort.”
So, we might learn that our application architecture or design is the root cause of numerous unanticipated issues that manifest only when the application is deployed as part of a scaled, dynamic, cloud-based collection of related solutions. The problems manifest on a level we can’t possibly duplicate in a controlled test environment, or drive out through a set of example-based or property-based test cases.
Fortunately, we designed our application and its execution environment to be appropriately observable. Otherwise, we probably never would have learned this. Unfortunately, the only way to fix the problem is to replace the application entirely. Historically, replacing an application has been a fool’s errand. True, it’s an errand on which countless fools have embarked, and yet still a fool’s errand. “It’s only a re-write!” has been the suicidal battle-cry of many a project.
When we study software in school, we either learn Computer Science (how computers work, how operating systems and compilers work, data structures, fundamental algorithms etc.) or Management Information Systems (how to manage software projects and budgets, etc.). Then most of us go out into the world and work in application development and support; an occupation requiring a boatload of skills we never learned in school. That’s why nearly all software out in the world is hacked up garbage based on no engineering principles whatsoever. That includes virtually all the software embedded in IoT devices, which are currently proliferating rapidly. Interesting times ahead, indeed.
We need practical ways to make software easy and cheap to replace, either wholesale or one function/component at a time. Random design and development practices won’t get us there. If we can get software developers to follow a few basic principles when building applications, I think we can go a long way toward making our software easier to replace.
- Separate concerns. This is a basic principle of software design that is expressed in various ways depending on context. One of the oldest expressions of the idea is the concept of modular code. In the Object-Oriented world, the same idea is often expressed as the Single Responsibility Principle (SRP). Notice the wording in the cited definition of SRP: “one reason to change.” That’s key to designing replaceable software. Each component (by whatever name – module, class, package, assembly, what-have-you) does just one thing and the only reason to change that component is to change that single thing. And the other way around: If we want to change any single thing, we only have to worry about changing one component.
The scope of the “thing” depends on the scope of the component. A layer in the OSI Model does just one thing, but that thing is of much larger scope than the thing done by a single function in a language like Haskell or F#, or a single method in a language like Java or Ruby. Context matters. The idea of breaking applications up into small, distinct pieces, each of which does just one thing, is fundamental to microservices architectures and the serverless model. In principle, we can replace any single service in a microservice-based solution without affecting the other services. Of course, that depends on our applying sound software engineering principles in our work; you know…the stuff we didn’t learn in school.
There’s also the idea of cross-cutting concerns. It may not be practical to separate those completely. There was a fad a few years ago around Aspect-Oriented Programming (AOP). You can still do it, if you really insist. The reason it never caught on in a big way is that it’s more trouble than it’s worth. It isn’t that the concept is flawed. In any case, some of the usual cross-cutting concerns can be handled by the execution environment, the cloud infrastructure. Security comes to mind, although there are some security-related design points that ought to be handled by applications (details out of scope here). Logging is probably the main cross-cutting concern of interest in the context of scaled cloud-based applications. The 12 Factor model proposes keeping it simple within applications; just dump raw event data to stdout. Let a different facility deal with collecting and storing the data. Cloud infrastructures handle this. It isn’t a “concern” for applications, apart from logging event-level data, preferably in a way that isn’t tool-dependent and that makes no assumptions about aggregation or formatting.
- Trust the environment. This concept stems from two familiar ideas: premature optimization and loose coupling. We want to avoid hand-optimizing code altogether, if we can. Let the execution environment handle issues like perceived response time and throughput, apart from “obvious” things like not coding a database call inside a loop. With applications chopped up into small pieces, the major component of perceived response time will be network latency rather than compute time. Optimizations such as reducing four lines of code into one by putting multiple expressions on the same line really aren’t improving anything. Write the business functionality in a language that’s easy to modify, and craft acceptance-level or story-level test cases that make it “easy” to re-implement the functionality in any other language at any time, “just” by making the test cases pass.
Let the cloud environment deal with dynamic changes in demand, network latency, and other matters (like logging, mentioned above). Applications need not include any built-in knowledge of those considerations. Ideally, applications should never know what environment they’re running in. For maximum replaceability, they must instead be as “vanilla” as possible in all respects; especially in their interactions with other components. To keep our applications as simple and replaceable as possible, we want to push as much operational detail out into the surrounding execution environment as we can.
Like the other two system qualities, replaceability has its downsides, too. The most obvious is that as we make the pieces smaller and smaller, we increase the amount of “glue” in between the pieces to make the whole system work. We’re not reducing complexity, we’re just moving it around. We have to choose our poison. In this case, I’m thinking we should choose the poison of coordinating a lot of small pieces over the poison of writing complicated, monolithic chunks of code. The objective is to make it feasible to replace all or parts of an application, as we learn more about its operational characteristics through analyzing it in production.
Another thing that could be different here, depending on your habitual mindset about applications: We need to see our code as a disposable, temporary, or tactical asset rather than as a long-lived, strategic asset. The implication is we want to emphasize replaceability at the expense of maintainability, should we discover any conflicts between those two concepts. In general, it means we don’t try to build in “hooks” to account for possible future changes; instead, we build in replaceability so that we can quickly replace the old application with a new implementation; no hooks needed.
In principle (and those are magic words, to an extent), if we make our components small and single-purposed, and we keep them simple and ignorant of their execution environment, we ought to be able to replace any one of them with an hour or two of development effort, including all necessary testing and validation.
For decades, we’ve attempted to perfect our software. We’ve improved design and coding methods, testing methods, and monitoring and recovery methods. But as technology has advanced over the years, the possible imperfections grow faster than our ability to prevent or cope with them. The situation has reached the point that we need to reconsider our goals for designing, testing, and monitoring software. I propose we emphasize resiliency, observability, and replaceability over traditional areas of focus such as maintainability and hand-optimization.