How "let it fail" leads to simpler code

The BEAM virtual machine (The VM that runs our Elixir and Erlang code) is famous for being fault-tolerant.
Good stories were told about many reliable applications being written in Erlang.
And people tend to attribute their high-reliability to the language's fault-tolerance nature.
But I think this attribution is only half-true.
Fault-tolerance definitely helps, but it's just a nice starting point.
Where fault-tolerance really shines is that
we can let it fail as soon as the code is in an unexpected state,
we can write much less error-handling code,
and we can focus much more on our business logic.
With much simpler code, we can maintain our applications much more easily, raising our chance to making these applications more reliable and maintaining them in the long run.
This post explains why that's the case and how to leverage a fault-tolerant language and write simpler code.

Distinguish expected and unexpected errors

The biggest lesson here, is not to use Supervisors in every way possible and sweep error-handling code under the rug.
But to really think about what error-handling code to write and what to not write at all.
To do that, we need to really understand and distinguish different types of "errors" in our system.

In short, I would put "errors" into 2 categories:

Expected errors
Expected errors are predictable situations that are slightly off from the happy-path.
When these errors occur, the user (or the client code) can fix it by themselves.
Invalid parameters or inputs are the best examples.
Unexpected errors
Unexpected errors are unpredictable situations that are not fixable by the user.
Most of programming languages would raise an exception in case of unexpected errors.
When these errors occur, our code cannot continue running at all.
And the responsibility to fix it falls back to the developers or operators.
Out of disk space, network failures are the best examples.

Error is a simple word but has too many meanings.
To the user, an "username is taken" error is about the same as a "500 Internal Server Error" page.
The user doesn't know whether an error is expected or unexpected when she sees an error page or flash message.
To the developer, a function call may return an error tuple and another function may raise an exception.
But the error tuple might be the unexpected, while the exception might be the expected.
So it's easy to mix expected and unexpected errors unless we as developers really think about that when writing our code.

The Erlang paper (Making reliable distributed system in the presence of software errors) gives a great answer to the question of "what is an error?"

It is the programmer who decides if an exception corresponds to an error...

Schneider gives a number of definitions of fault-tolerance in his 1990 ACM tutorial paper.
In this paper he says:
A component is considered faulty once its behaviour is no longer consistent with its specification

In the end it all comes down to the specifications (i.e. expected or unexpected).
If the spec has defined the response under a certain abnormal situation, then it's an expected error.
And if the code find itself in an undefined state, then it's better to fail early.

Here are some examples to help you distinguish expected and unexpected when you think about errors:

  1. Divide by zero error
    • When it's expected

      When you are building a calculators, it's expected that a user may type in 1 / 0.
      So the divide-by-zero error should be considered in this case.
      Developer should think about what to return for these possible inputs.

    • When it's unexpected

      When you really pass 0 as the divisor, the runtime doesn't know how to proceed.
      The behaviour is undefined mathematically, period.
      So it's reasonable to raise an exception.

  2. File does not exist error
    • When it's expected

      When you are building a text editor, it's expected that a user may open a non-existing file to create it.
      The text editor may not even treat it as an error, but a daily operation.

    • When it's unexpected

      When your app needs to read a config file during it startup process, it's often better to treat the missing config file as an unexpected error.
      A common solution in this case is to fallback to some default values,
      I find it to be too confusing both at operation level and code level.
      It's hard to debug when the config file is missing, but the app still works (but differently).
      It's hard to understand the code because the fallback may happen at any level.
      So I would always assume the config file is there when the app boots, and raise if it's not the case.
      (I learned this idea from Chris Keathley - Building Resilient Systems with Stacking - ElixirConf EU 2019 - YouTube)

Write less code by "ignoring" unexpected errors

Now that we've distinguished expected and unexpected errors,
how would this help us write less code?

The trick is that we can now "ignore" unexpected errors once and for all.
Let the fault-tolerant runtime handle it.

First, let's admit it, we cannot predict or handle all the unexpected errors.
We want our nodes to be always online, but ultimately, it just takes a short power outage or network outage to take down one of our nodes.
We can't handle this kind of hardware errors at software level.
So we don't try to handle them, but to detect them via links and supervisors.

The same logic applies to all the unexpected errors.
They are unexpected according to the specification, we don't know how to handle them.
So let's stop proceeding and fallback to an easier task.

The easier task might be:

  • log the unexpected error
  • retry from scratch (i.e. restart the process in BEAM)
  • stop the application

Write more confident code by asserting our assumptions

With the privilege of "ignoring" unexpected errors, we can write less, but more confident code.
Because we can build a fence around our code, an assertion shell to check our assumptions.

  • If a function only accepts an non-empty list, pattern match that in the function header:

    def average([_ | _] = list) do
      ...
    end
    
  • If a result is always expected from a list, pattern match that and ensure it's not nil:

    %SomeStruct{} = result = Enum.find(list, ...)
    
  • If a case statement only expect some possible inputs, there is no need to add an always match clause in the end:

    case File.read(path) do
      {:ok, binary} ->
        ...
    
      {:error, :enoent} ->
        {:error, :file_does_not_exist}
    
      # the other errors are unexpected and we should abort when they appear
    end
    

By writing code in this style, we are basically distilling the input of our functions1, making it purer and purer as they travel deeper into our core logic.
I call this style of coding as Assertive Shell, Confident Core.
(Inspired by RubyConf 12 - Boundaries by Gary Bernhardt - YouTube)
Or an Erlang veteran may call it as Let it fail.

And this coding style is also very MVP-ish and extensible.
When we just started with an application, the specifications are always vague and undefined in many cases.
So instead of trying to spend too much effort and cover all the cases we can think of, we just ensure we are always on our designed happy path, and raise an exception if it's not the case anymore.
As the spec becomes more and more sophisticated, our code can grow accordingly and handle more cases according to the spec.

BEAM VM makes "let it fail" production-ready

We can apply this coding style in any programming language.
So what's the special of BEAM and its fault-tolerance here?
The answer is isolation.

Every process is isolated from each other in the BEAM VM.
So it's okay if a process dies due to an exception.
Its supervisor would bring it back to a clean state.
This mechanism is baked into the language.

In other languages like Ruby, I'm not so confident to raise an exception in a random function.
I'm afraid that this exception would bring the whole application down.
I can only rely on the framework (Rails, Sidekiq, etc.) to catch this exception and handle it magically.

I guess this is why this coding style is not so popular after all.
The BEAM VM offers this privilege to us.

Summary

"Let it fail" and fault-tolerant is a cliche topic in the BEAM community.
I was hesitate to write this article,
as I always told myself the idea was already explained so well, like the Erlang Paper (Making reliable distributed system in the presence of software errors) did.

I'm glad with the final piece though,
because I think we didn't put enough emphasize on the simplicity and confidence the "Let it fail" coding style brings to us.
Remember to define your specifications more thoroughly,
and enjoy the ignorance of the unexpected errors!

Footnotes:

1

If we are going to extreme here, we can check if a function finishes within a certain time limit:

case :timer.tc(fn -> ... end) do
  {time, result} when time < 10_000 ->
    result
end