Circuit Breakers

Every building today has one, you’ve probably already seen one. It is essential for your security and to prevent your electronics don’t break.

Circuit breakers are designed to prevent your electrical circuit from damage caused by too much current flowing through it, it basically switches automatically to interrupt the current flow until somebody can fit it.

It is famous at wall street too, and can be called “trading curb”, it’s used to prevent dramatic losses and speculative gains, when the market fall or rises a lot in a small data frame, it opens and stops trading.

Circuit breakers are a common pattern in distributed systems too, it was described by Michael T. Nygard in his famous book Release It.

Today most of the computer systems can be considered distributed systems, or at least make one request to external services, the way of communicating can vary but 90% of our systems are exchanging information across the internet.

In a constant seek for more and more reliable and stable systems, engineers need to be careful about everything that can go wrong, the internet isn’t 100% trusty and when you’re making requests between two points using it, you need to keep in mind that a lot of things can go wrong, as we learned with the famous “Fallacies of distributed computing” for example, and that’s why integration system can be considered an antipattern of stability .

In this context, today, circuit breakers became a popular choice to handle HTTP errors, but it can be used to handle critical operations too.

What will occur with your system if another system or operation starts to fail? You’ll start to have a cascading failure “A cascading failure occurs when an error in one system affects others, with the initial failure walking down into your systems layer causing other errors.”.

Cascading Failure spanding across services.

The algorithm is simple and short, it has two main states “open” and “closed”, and we start it closed (the normal state) when the circuit is closed everything is working as expected, if one error occurs during the execution of the handled operation the circuit starts to track the number of errors, if the number the errors at a certain time crosses the threshold it will open.

The second main state is “open”, the operation will not be executed because a sequence of N errors in X time has made the circuit open.

After some time (predefined too) on the open state, the circuit changes to a “transition state” called half-open. When it is in this state, the circuit executes the next call to the handled operation, if the execution success the circuit goes to closed state again, if the execution fails it back to open state and wait for the next change to half-open.

Obviously in some cases only one successful request isn’t good to switch the state to closed again, and it can be configured too.

Technical Details

Exceptions or response structures

If you search on the internet, you’ll find a lot of different ways to implement circuit breakers, some people following the suggestion of the Release It! book and other ways.

The first thing that you need to pay attention is how you’ll know if the operation successes or failed, you have two great options here, you can control it by exceptions or by the response.

When handling with exceptions you have the advantage that it’s easy to start using circuit breaker because exceptions are already present in a lot of languages, in case of your own operations you only need to raise an exception when something goes wrong, when dealing with 3rd party libs, almost all will raise exceptions. Particularly I don’t like this approach, I know that some people may like to control flow based on exception but I don’t consider it a good way, and it’s important to remember that some languages like rust doesn’t have exceptions.

You can know if your operation has failed or not based on its response, it looks much more smoothie, readable, and simple. But what’s the operation needs to return as its “response”? The operation can return any structure that contains information if the operation successes, for example, if you are working with an object-oriented language you only need to return any object that responds to a message like success?, it will be much more “object-oriented”.

class Response
  def success?
    ...
  end
end

This approach of return response structures has become popular in the last years, “hyped” languages encourage it, for example, go lang have its built-in error the type that is an interface, it’s common to an operation return it’s the result and an error (if any), for example:

f, err := os.Open("filename.ext")
if err != nil {
    log.Fatal(err)
}

Rust has it’s result type too, called Result<T, E>, if you’re dealing with functional languages, in Elixir is common to return a tuple with the first item being the return status of the function that you can pattern match against:

{:ok, %{"age" => 22, "name" => "Otavio Valadares"}}

For these reasons, I think the best way to handle your responses when dealing with circuit breakers (and any operation) maybe with response structures.

Track errors on 3rd party systems or memory

Another thing that you need to deal with is how you’ll handle the error tracking, the first obvious option is to trust in 3rd party service like a redis database, you’ll only need a key to increment with TTL. But with this approach you’ll create a shared resource, that is an antipattern of stability too, if your redis goes down your entire application will go down too? You’ll deal with the famous “Quis custodiet ipsos custodes?“ because your connection with redis will be not faulted tolerant.

The second option is to save this error tracking in memory, you have some kind of structure that stores the count and the timing, but when working with an application handling a lot of operations you can have memory problems storing this structures in memory, I know that for 95% of the cases this is not a problem today, but if you’re dealing with low memory applications it may be a problem.

In some cases, you may want to use the Redis way for some reason, in this case, I recommend using a memory circuit breaker to watch the redis connection, in other cases I think in memory track the best way to solve this problem.

Observability

Observability is another important thing when working with a circuit breaker too, if your circuit opens, it can save your application for some errors, but the true magic stands when this information can be used by your stakeholders and by other applications to change its behavior automatically, for example, if your circuit that handles your 3rd payment partner closes you can hide your payment tab in the mobile app.

You need to provide the status of your circuits (or a group of them, like, all circuit that handles some 3rd partner or critical operation when saving an invoice) somewhere, it can be simple as providing it in your healthcheck route, but when providing it in an endpoint it can lead to some problem if another application needs to check your circuit breaker status every time that will do an operation.

A good approach is to put a notification message at your favorite message broken, if an application is interested in this message it reads and takes its own decisions. Another solution that can be considered is building a “circuit breaker control pane” an application that knows the state of all circuit breakers of your company, but it will lead you to a great bus factor issue.

It’s important to show the circuit breaker switching is states and actual states for human too, and will be good to put it in your Grafana, in some cases a bot that notifies your Slack channel, and integrates it with your alarm system like OpsGenie.

Service Mesh

Code a circuit breaker logic for all applications can be frustrating and even if you’re using a library it can be boring to install it and set up in every application, thinking about that all that boring stuff about repeated logic in the application level, service mesh was created and one thing that almost all service mesh system has in its sidecars is the circuit breaker, if you’re using any service mesh, you don’t need to code it at the application level.

But I know that service mesh technologies is not a reality for a lot of companies today, and start putting your circuit breakers at code level can be a good way to start using it.

Conclusion

The circuit breaker is a good pattern that can bring to your applications an improvement in its stability. It’s worth to start using it and launch a stability culture at your companies if it hasn’t already.

Final thought

If you have any questions that I can help you with, please ask! Send an email (otaviopvaladares@gmail.com), pm me on my Twitter or comment on this post!

Follow my blog to get notified off every new post: