Archive for June, 2016

One of the nice things about taking months or even years to finish reading a book is that it gives me plenty of time to reflect on it. Sure, I could binge read, but the material is far less sticky. The longer I spend with a book, the more connections I can make. “Failure Is Not An Option,” which I just completed a few days ago, is a good example of this.

The book itself is a retelling of history of the manned space program from the point of view of someone intimately familiar with the events. At first blush, it would appear to have nothing to do with the process of software development. The Apple Pencil I’m using at the moment probably has more computing power than anything available at the time. So, what is the connection?

It is absolutely true that when dealing with matters of human life that failure is not an option. When you look at the history of American manned space flight from the outside, the only thing you see is the ultimate expression of this. You see successful execution, even in the case of unforeseen situations. And that’s the key, it’s from the outside. Success was not achieved because perfect people perfectly executed perfect plans using perfect equipment under perfect conditions. Yes, there were times when the execution, equipment or conditions were perfect. Most of the time, success was ensured inspite of less than perfect conditions. For me that’s the story.

During the space program, failure was a mandate. When you exist in a world of constrained resources, the best way to ensure that you will be successful is to see how you respond when the resources that you depend upon are available. But, wouldn’t you always being everything you needed and have it on hand? One would hope so, but let’s say, for the sake of argument, that you didn’t. What then? If something bad happens, how much time do you have to get things back on track before it’s game over? Well, that depends on what happened, when it happened and how important it was.

Okay, sure, fine, but what did they do and what does it have to do with software development?

In the case of the space program, failure was ensured before the mission. Every possible they could conceive of. Teams of people had the sole job of enducing failures during mission simulations. We’re not talking SimCity or Second Life here. These simulations were conducted in physical hardware as identical as possible. The difference is that this user interface was driven not by physical sensors, but computers. Crews were made to experience failure over and over until their responses were second nature. When you think about it, this is no different from any physical endeavor. The more you train yourself to respond to situational changes, the better your performance becomes.

As of late, I believe that many of the problems with software and hardware have been coming about because of a focus merely on the “happy path.” The idea that we should worry first about making it work and then later about failure conditions. Unfortunately, once management sees something that “works,” they  are unlikely to allocate time to break things. The situation has been made worse by the attitude that software will be tested in beta.

As a result, not only is inferior quality software being produced, but the software is being used in environments which expose their data to exfiltration. We see this is malware infecting point-of-sale systems. We see it in OpenSSL and other open source code. Software being widely adopted under the assumption that someone else must be testing it. Entire generations argue for speed over safety.

So, why has this attitude taken hold? Is it harder to design for failure first? Not really. In fact, it makes unit testing easier as it builds in the failure paths before the actual implementation. But doesn’t this make the code slower? Not implicitly. Between the use of modern tool chains and profile guided optimization, the cost of fail first, fail fast is minimal. Why don’t we do privilege segregation? For the same reason we don’t apply principles of  MVC, MVVM or VIPER. People sit in front of a screen and type. They use the excuse of doing “agile” development on short cycles for not properly planning. Now, I have nothing against short cycles, but to be used properly, you have to accept that not every problem can be evaluated, solved, implemented and tested in 2 or 3 weeks. Some things are hard. Some things can only be accomplished by specific individuals. Some tasks have dependencies on outside resources. Software isn’t building IKEA furniture. There are high level of indeterminacy that can and do crop up.

Neither can we pretend that the software we create lives in isolation or that security is the responsibility of the user. Holding onto data simply because it would make the developer’s job easier during development is not enough of a reason. Not using OS provided encrypted data storage because it makes it harder to debug is simply lame. Transmitting data in the clear or without access control just invites both data exfiltration and command-and-control injection.

Here’s the thing. If you want to succeed, you must fail. You must fail first and you must fail fast. Failure helps to characterize the system. It helps in the creation of documentation. It helps validate the design. Failure makes gives you a better understanding of the problem space. In short, failure is not an option; it is a requirement.

Read Full Post »

%d bloggers like this: