Introduction to the Fail fast! Principle in Software Development

Christian Neumanns

2016-10-25

Abstract

This article introduces the Fail Fast! principle. What is it? When should we use it? How does it help us to write better code?

Whenever an error occurs in a running software application there are typically three possible error-handling approaches:

The Ignore! approach: the error is ignored and the application continues execution
The Fail fast! approach: the application stops immediately and reports an error
The Fail safe! approach: the application acknowledges the error and continues execution in the best possible way

Which approach is the best one?

Which one should you apply in your application?

Before answering this vital question, let us first look at a simple example.

Suppose we have to write a (rudimentary) web application that displays a warning message near a water fountain to warn people that the water is polluted.

The following HTML code does the job:

<html>
   <body>
      <h2 style="color:red;">Important!</h2>
      <p>Please <b>DO NOT</b> drink this water!</p>
   </body>
</html>

The result displayed in the browser looks like this:

Now let us insert a small bug. Instead of </b> we write <b> after DO NOT, as shown below:

<p>Please <b>DO NOT<b> drink this water!</p>

Two interesting questions arise:

What should happen?
What will happen?

The second question is easy to answer. We just have to feed our browser with the buggy HTML code. This is the result - as displayed in Chrome, Edge, Firefox, Internet Explorer, and Safari at the time of writing:

Before reading on, ask yourself: "Which approach has been applied by the browsers?" ...

Obviously, the Fail fast! approach has not been applied because the application continued and did not report an error. The only difference to note is that more text is now displayed in bold. But the message as a whole is still displayed correctly and people are warned. Nothing to worry too much!

Let’s try another bug. Instead of <b> we write <b before DO NOT, as shown below:

<p>Please <b DO NOT</b> drink this water!</p>

This is the result - again as displayed in the browsers mentioned before:

Panic! Now the program does exactly the opposite of what it is supposed to do. The consequences are terrible. Our life-saving application has mutated into a killer-application (but not the kind of killer-application we all dream to write one day).

It is important to be aware of the fact that the above example is not just a theoretical, exaggerated example. There are a good number of real-life cases with ‘little bugs’ having catastrophic consequences, such as the Mariner 1 spacecraft that exploded shortly after lift-off due to a ‘missing hyphen’. For more examples, see: List of software bugs.

As we can see from the above example, the consequences of not applying the Fail fast! approach vary largely and can range from completely harmless to extremely harmful.

Note

Unless we look at the browsers' source code, we don't know whether the Ignore! or Fail safe! approach is applied. My guess is that the tokens DO and NOT in the HTML code are interpreted as attributes of tag b, but without values (such as DO="foo"), and without being part of standard HTML. Therefore they are ignored. The closing </b> is probably also simply ignored, because "drink this water" is displayed in bold.

So, what is the correct answer to the important question "What should happen?"

Well, it depends on the situation. There are, however, some general rules.

The first rule is:

We should never "Ignore!" an error - unless there is a really good reason to do so.

This rule is well known and doesn't need any further explanation.

Remember rule 6 of The 10 commandments for C programmers, eloquently written in old English by Harry Spencer:

"If a function be advertised to return an error code in the event of difficulties, thou shalt check for that code, yea, even though the checks triple the size of thy code and produce aches in thy typing fingers, for if thou thinkest 'it cannot happen to me', the gods shall surely punish thee for thy arrogance."

The second rule is:

During development we should apply the Fail fast! approach.

The rationale behind this rule is easy to understand:

The Fail fast! approach helps in debugging.
As soon as something goes wrong, the application stops and the error message helps to detect, diagnose and correct the error. Therefore the Fail fast! approach leads to more reliable software, reduces development and maintenance costs and prevents frustrations and catastrophes that would otherwise risk to appear in production mode.
Even if a bug doesn't lead to a severe failure, it is always best to detect it as soon as possible, because the costs to fix a bug raise exponentially with the time passed in the development cycle (compile-, test-, production-time).
The consequences of bugs appearing during development mode are generally not harmful.
The customer doesn’t complain, money doesn’t go to the wrong account, and rockets don’t explode.

Failing fast is commonly considered as a good practice in software development. Here are a few supporting quotes:

"Encourage good code habits ... This has many corollaries, including 'fail fast' ..."
The Google Guava team - Philosophy Explained
"... 'failing immediately and visibly' sounds like it would make your software more fragile, but it actually makes it more robust. Bugs are easier to find and fix, so fewer go into production."
Jim Shore / Martin Fowler - Fail Fast
"Some of the hardest bugs to track down have (in part) been caused by code that silently fails and continues instead of throwing an error. ... It is better to return an error as soon as a failure case is detected."
Henrik Warne - 18 Lessons From 13 Years of Tricky Bugs
"We don’t wait long periods of time before learning that something isn’t working. We fail fast ..."
Joshua Kerievsky - An Introduction to Modern Agile

However, the situation can change radically when the application runs under production mode. Unfortunately, there is no one-size-fits-all rule. Practice shows that it is generally better to also apply the Fail fast! approach by default. The final damage resulting from an application that ignores an error and just continues arbitrarily is generally worse than the damage provoked by an application that stops suddenly. For example, if an accounting application stops working suddenly, the user is angry. But if it silently ignores an error and continues and produces wrong results (such as an unbalanced balance sheet), the user is very angry. ‘Angry’ is better than ‘very angry’. Therefore, in this case the Fail fast! approach is better.

In our previous HTML example, the Fail fast! approach would also be much better. Suppose that, instead of continuing execution, the browsers displayed an error message. Then the developer(s) would immediately get aware of the problem and the code could be fixed quickly and easily, without causing any harm. But even if the buggy code went into production (for strange reasons), then the worst case scenario would be less terrible. Displaying "Please drink this water" can be dreadful. On the other hand, not displaying any message, or just displaying an (incomprehensible) error message, would probably just result in a very low percentage of people daring to taste a small quantity of water.

In practice, each case must sometimes be studied individually and carefully. This is especially true if the greatest possible damage is high, such as in medical applications, money transfer applications or space invader applications. For example, applying the Fail fast! rule is obviously the right approach as long as a rocket to Mars didn’t take off. But as soon as the rocket has started, stopping the application (or, even worse, ignoring an error) is no longer an option. Now the Fail safe! approach must be applied in order to do the best we can do.

A good option is sometimes to fail fast, but minimize the damage. For example, if a run-time-error occurs in a text editor application, the application should first automatically save the current text in a temporary file, then display a meaningful message to the user ("Sorry, ... but your current text is saved in a temporary file abc.tmp"), optionally send an error report to the developers, and then stop.

Hence, the third rule:

In critical applications the Fail safe! approach must be implemented in order to minimize damages.

To summarize:

In development mode we should always apply the Fail fast! approach.
In production mode:
- We should generally favor the Fail fast! approach by default.
- Critical applications that risk leading to high damages in case of a malfunction need customized, context-specific and damage-eliminating (or at least damage-reducing) behavior. Fail safe and react appropriately! approaches must be applied in fault-tolerant systems.

The same idea is expressed by the excellent Rule of Repair in The Art of Unix Programming, written by Eric Steven Raymond:

"Repair what you can - but when you must fail, fail noisily and as soon as possible."

	Note
	Further information and examples are available in Wikipedia under Fail fast, Fail safe, and Fault tolerant computer system.

In any case, it is always helpful to use a development environment that supports the Fail fast! principle. For example, a compiled language supports the Fail fast! rule because compilers can immediately report a whole plethora of bugs. Here is an example of a stupid bug that easily escapes the human eye and can lead to 'unwanted surprises', such as a hanging system due to an infinite loop:

var row_index = 1
...
row_indx = row_index + 1

Typos like this (i.e. writing row_indx instead of row_index) are common and are immediately caught by any decent compiler or (even better) by an intelligent IDE.

Luckily there are a good number of very effective Fail fast! features that can be natively built into a programming language. They all rely on the following rule:

Errors should preferably be automatically detected at compile-time, or else as early as possible at run-time.

Examples of powerful Fail fast! language features are: static and semantic typing, compile-time null-safety (no null pointer errors at run-time!), design by contract, generic type parameters, integrated unit testing, etc.

Even better than detecting errors early is to not allow them by design. This can be achieved if the programming language doesn't support error-prone programming techniques such as global mutable data, implicit type conversions, silently ignored arithmetic overflow errors, thruthiness (e.g. "", 0 and null are equal to false) etc.

Therefore, we should always prefer a programming environment ( = programming language + libraries + frameworks + tools) that supports the Fail fast! principle. We will debug less and we will produce more reliable and safe code in less time.