Circuit Breakers

Every building today has one, you’ve probably already seen one. It is essential for your security and to prevent your electronics don’t break.

Circuit breakers are designed to prevent your electrical circuit from damage caused by too much current flowing through it, it basically switches automatically to interrupt the current flow until somebody can fit it.

It is famous at wall street too, and can be called “trading curb”, it’s used to prevent dramatic losses and speculative gains, when the market fall or rises a lot in a small data frame, it opens and stops trading.

Circuit breakers are a common pattern in distributed systems too, it was described by Michael T. Nygard in his famous book Release It.

Today most of the computer systems can be considered distributed systems, or at least make one request to external services, the way of communicating can vary but 90% of our systems are exchanging information across the internet.

In a constant seek for more and more reliable and stable systems, engineers need to be careful about everything that can go wrong, the internet isn’t 100% trusty and when you’re making requests between two points using it, you need to keep in mind that a lot of things can go wrong, as we learned with the famous “Fallacies of distributed computing” for example, and that’s why integration system can be considered an antipattern of stability.

In this context, today, circuit breakers became a popular choice to handle HTTP errors, but it can be used to handle critical operations too.

What will occur with your system if another system or operation starts to fail? You’ll start to have a cascading failure “A cascading failure occurs when an error in one system affects others, with the initial failure walking down into your systems layer causing other errors.”.

Cascading Failure spanding across services.

The algorithm is simple and short, it has two main states “open” and “closed”, and we start it closed (the normal state) when the circuit is closed everything is working as expected, if one error occurs during the execution of the handled operation the circuit starts to track the number of errors, if the number the errors at a certain time crosses the threshold it will open.

The second main state is “open”, the operation will not be executed because a sequence of N errors in X time has made the circuit open.

After some time (predefined too) on the open state, the circuit changes to a “transition state” called half-open. When it is in this state, the circuit executes the next call to the handled operation, if the execution success the circuit goes to closed state again, if the execution fails it back to open state and wait for the next change to half-open.

Obviously in some cases only one successful request isn’t good to switch the state to closed again, and it can be configured too.

Circuit breaker states

Technical Details

Exceptions or response structures

If you search on the internet, you’ll find a lot of different ways to implement circuit breakers, some people following the suggestion of the Release It! book and other ways.

The first thing that you need to pay attention is how you’ll know if the operation successes or failed, you have two great options here, you can control it by exceptions or by the response.

When handling with exceptions you have the advantage that it’s easy to start using circuit breaker because exceptions are already present in a lot of languages, in case of your own operations you only need to raise an exception when something goes wrong, when dealing with 3rd party libs, almost all will raise exceptions. Particularly I don’t like this approach, I know that some people may like to control flow based on exception but I don’t consider it a good way, and it’s important to remember that some languages like rust doesn’t have exceptions.

You can know if your operation has failed or not based on its response, it looks much more smoothie, readable, and simple. But what’s the operation needs to return as its “response”? The operation can return any structure that contains information if the operation successes, for example, if you are working with an object-oriented language you only need to return any object that responds to a message like success?, it will be much more “object-oriented”.

class Response
  def success?
    ...
  end
end

This approach of return response structures has become popular in the last years, “hyped” languages encourage it, for example, go lang have its built-in error the type that is an interface, it’s common to an operation return it’s the result and an error (if any), for example:

f, err := os.Open("filename.ext")
if err != nil {
    log.Fatal(err)
}

Rust has it’s result type too, called Result<T, E>, if you’re dealing with functional languages, in Elixir is common to return a tuple with the first item being the return status of the function that you can pattern match against:

{:ok, %{"age" => 22, "name" => "Otavio Valadares"}}

For these reasons, I think the best way to handle your responses when dealing with circuit breakers (and any operation) maybe with response structures.

Track errors on 3rd party systems or memory

Another thing that you need to deal with is how you’ll handle the error tracking, the first obvious option is to trust in 3rd party service like a redis database, you’ll only need a key to increment with TTL. But with this approach you’ll create a shared resource, that is an antipattern of stability too, if your redis goes down your entire application will go down too? You’ll deal with the famous Quis custodiet ipsos custodes? because your connection with redis will be not faulted tolerant.

Shared Resource

The second option is to save this error tracking in memory, you have some kind of structure that stores the count and the timing, but when working with an application handling a lot of operations you can have memory problems storing this structures in memory, I know that for 95% of the cases this is not a problem today, but if you’re dealing with low memory applications it may be a problem.

In some cases, you may want to use the Redis way for some reason, in this case, I recommend using a memory circuit breaker to watch the redis connection, in other cases I think in memory track the best way to solve this problem.

Observability

Observability is another important thing when working with a circuit breaker too, if your circuit opens, it can save your application for some errors, but the true magic stands when this information can be used by your stakeholders and by other applications to change its behavior automatically, for example, if your circuit that handles your 3rd payment partner closes you can hide your payment tab in the mobile app.

You need to provide the status of your circuits (or a group of them, like, all circuit that handles some 3rd partner or critical operation when saving an invoice) somewhere, it can be simple as providing it in your healthcheck route, but when providing it in an endpoint it can lead to some problem if another application needs to check your circuit breaker status every time that will do an operation.

A good approach is to put a notification message at your favorite message broken, if an application is interested in this message it reads and takes its own decisions. Another solution that can be considered is building a “circuit breaker control pane” an application that knows the state of all circuit breakers of your company, but it will lead you to a great bus factor issue.

It’s important to show the circuit breaker switching is states and actual states for human too, and will be good to put it in your Grafana, in some cases a bot that notifies your Slack channel, and integrates it with your alarm system like OpsGenie.

Service Mesh

Code a circuit breaker logic for all applications can be frustrating and even if you’re using a library it can be boring to install it and set up in every application, thinking about that all that boring stuff about repeated logic in the application level, service mesh was created and one thing that almost all service mesh system has in its sidecars is the circuit breaker, if you’re using any service mesh, you don’t need to code it at the application level.

But I know that service mesh technologies is not a reality for a lot of companies today, and start putting your circuit breakers at code level can be a good way to start using it.

Conclusion

The circuit breaker is a good pattern that can bring to your applications an improvement in its stability. It’s worth to start using it and launch a stability culture at your companies if it hasn’t already.

Final thought

If you have any questions that I can help you with, please ask! Send an email (otaviopvaladares@gmail.com), pm me on my Twitter or comment on this post!

Follow my blog to get notified every new post:

A tale about application infrastructure

Today we’re facing the container revolution but I feel like most of the people don’t know why we are using it and what problems existed before it, and the history behind the evolution of the application deployment.

In this post, I talk about the evolution of the infrastructure of the applications, obviously, this is a topic that has size and information to be a book, but I tried to summarize it in few lines to understand the key points and the background quickly.

Physical Server

A long time ago when developing applications, companies usually run their applications on typical physical rack servers, these servers were basically like your computer running an application on top of an operating system.

Physical Server X-ray

It was very common to find small-medium companies that have a small room inside their office with a classic 19inch-rack with a server running one application, large companies usually build buildings called “datacenters” with tons of racks only to host their applications.

It was only possible to scale as we know today as “vertical scaling”, you only need to add more hardware to the next floor of your rack and that’s it.

Physical Server Vertical Scaling

This model has a lot of trouble, the first one is that one server usually runs only one application and if this application doesn’t fully use server resources it may leave unused resources.

It was very expensive too, and not all companies had a budget to buy it, it was a problem for small companies or recent-founded companies.

At the beginning of the internet it worked, but with the growth of the internet, this model was not viable anymore.

Virtual Host

A few years later, RFC 2616 introduced us to HTTP/1.1 and with him, as described in the RFC, the ability to send a “host” header in a request providing the host and port information from the target URI, enabling the origin server to distinguish among resources while servicing requests for multiple hostnames on a single IP address.

This technology is called virtual host, and its most famous application is shared web hosting, with this one server can host multi websites. The price of hosting a website decreased, many businesses started to offer website hosting for a few dollars.

The mechanism is simple, the server analyzes the host header of an incoming request, and on its configuration you configure something like this “request for this site, go to this system path”. With this, you have the same IP address resolving DNS to multiple websites.

Virtual Host Working

At this point host, a web application was easier than bare metal servers but stills to have problems, and it takes us to the next step…

Virtual Machine

Time goes on and technology from lates the 60s started to being used massively, I’m talking about operating system virtualization, this concept allows single hardware to host many operation systems or application, each one with its own operating system and environment while still sharing the same hardware resources.

Virtual Machine X-ray

Using virtualization, a company just needs to buy a server with strong hardware and boot VMs as want (and the hardware support). Another good ability is to build custom OS images with pre-installed system requirements.

This new way of build application infrastructure maximized resource utilization and simplified application architecture, the price for deploy an application decreased, and the most important, the growth of virtualization comes with the first IaaS companies, offering virtual machines allocation with “one-click”, the most famous example is the Amazon Web Services (AWS) with his famous EC2 service.

The period of more growth in IT operations around the world comes at the same time, millions of users using your application was a reality, cloud computing comes and new ways of think about infrastructure comes, microservices architecture comes in response to large applications, and now engineers don’t need to deploy a single monolith application, they need to deploy a lot of small applications.

Fallowed by the giants IT operations and tech companies, DevOps emerged and now companies make dozens of deploys/day, and new requirements on infrastructure show up. Some problems of virtualization come to mind, like that it’s not totally optimized, images are large and sometimes the boot of a new “instance” was slow.

A new technology trend called “containers” comes with the promise of change the way we think about IT infrastructure, and resolve all the problems related to virtual machines.

Containers

Follow my blog to get notified every new post:

While virtual hosts, virtual machines and all that story happened, researchers around the world were working too, and they began to advance on an implementation of an old but gold UNIX system feature called chroot, the OS-Level virtualization forerunner.

With a great time skip, in 2013 a technology called LXC was announced and later turned into the famous Docker Containers. This technology drew the attention of engineers around the world because it solves a lot of virtual machine problems.

Linux containers are a group of processes that are isolated from the rest of the system, think it is like a box isolated from the world, inside this box, you can put your application and all its dependencies and it will run isolated from the rest of the operating system, but using the same kernel of other processes.

Image illustrating putting application dependencies inside container

But now you can ask me, what the difference between Linux containers and virtual machines? The difference is simple and powerful.

Difference between containers and virtual machines

Linux containers provide process-level virtualization and don’t need to emulate the hole OS like virtual machines, they share the same kernel with the host OS (you can see on the image that Linux containers don’t boot an entire OS to work), its made containers lightweight, they can boot in milliseconds (VM usually needs minutes to boot), the containers images are smaller than VM images, containers have better performance than virtual machines and they are very secure, because each container and its process are isolated from the rest of the system.

Containers are great, not only to infrastructure but it changed the way of developing too, now you don’t have the famous problem of “works on my machine” anymore, you can code your application locally using containers and the environment that your application is running on your machine, will be the same that it will run on your infrastructure.

Engineers started to deploy their applications using containers (and develop too), probably using the famous Docker containers, but it wasn’t using 100% of the potential of containers, it seemed that something was missing and thankfully the missing “something great” didn’t take long to appear.

Container Orchestration

It was the missing thing to use 100% of Linux containers’ power, a simple and beautiful way of thinking about infrastructure and application management, now the way of think about it is totally different from the beginning when you only think about a simple server running your application.

Let’s think about its concept, you have something that we usually call “operator”, the operator it’s like a big brother and is watching everybody inside its cluster, the cluster is composed of N nodes that usually are different machines and inside each machine, we have N Linux containers (Here’s the magic!).

As the name already says, the operator is operating our cluster, and he knows something that is like a recipe that the developer writes to tell orchestrator about how the application should behave when deployed.

This recipe tells standard things (and sometimes peculiarities of the chosen orchestration technology), like:

  • Number of containers that your application will use (number of replicas)
  • Memory and CPU reservation
  • Healthchecks
  • Horizontal/Vertical scaling rules
Orchestrator and its nodes running containers]

The image is illustrating a cluster of four nodes, each one running N containers and the central orchestrator.

When orchestrator starts to do its job, your infrastructure will have some kind of life, it will behave like an organism, it will boot new containers, kill old containers, replace unhealthy containers, scale your application based on metrics, share traffic between your N replicas, your application will have resilience and performance, a universe of possibilities will exist in your infrastructure based on this basic concept that uses Linux containers with clustering, load balancing, and metrics.

Following this concept, a lot of technologies have emerged and become popular to orchestrate containers, like, ECS, Docker Swarm, and the most hyped, Kubernetes (K8s). Each one has its own peculiarities and properties (Kubernetes being the most complex of them).

Summarizing, container orchestration is the automation of all aspects of coordinating and managing your containers, it manages the lifecycle, scaling, redundancy and much more for you.

Today, millions of people are using your software and you need to think about resiliency, scalability, monitoring and much more, container orchestration solves a lot of problems related to it, but it’s hard to get it working, it is heavy coupled with clustering and load balancing and other concepts that have grown together technologies described in this text.

What’s next

We’ve talked about some important steps in the evolution of applications infrastructure until we arrive today’s powerful container orchestration, that allowed us to deal with today’s problems, but what’s next? What’re the next problems we will face when developing? What technology will solve these problems? These questions only the time will answer but the lesson we take is to always be studying and adapting to new tendencies.

Final thought

If you have any questions that I can help you with, please ask! Send an email (otaviopvaladares@gmail.com), pm me on my Twitter or comment on this post!

Follow my blog to get notified every new post:

Reference

https://devopedia.org/container-orchestration https://avinetworks.com/glossary/container-orchestration/ https://www.hpe.com/us/en/what-is/container-orchestration.html https://geekflare.com/docker-vs-virtual-machine/

What’s the Diff: VMs vs. Containers
https://www.redhat.com/en/topics/containers/whats-a-linux-container https://blog.britesnow.com/understanding-kubernetes-value-867c163d5ed2 https://www.toptal.com/kubernetes/what-is-kubernetes https://www.infoworld.com/article/3268073/what-is-kubernetes-your-next-application-platform.html https://blog.britesnow.com/understanding-kubernetes-value-867c163d5ed2 https://en.wikipedia.org/wiki/Multitier_architecture https://www.quora.com/What-is-the-difference-between-virtual-hosts-and-virtual-servers
The Definitive Guide to Bare Metal Servers for 2021
https://www.techradar.com/news/the-rise-and-rise-of-bare-metal-servers https://www.google.com.br/search?q=single-tenant+servers&oq=single-tenant+servers&aqs=chrome..69i57&sourceid=chrome&ie=UTF-8 https://en.wikipedia.org/wiki/Bare-metal_server https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol https://en.wikipedia.org/wiki/Data_center http://www.iasl.com/solutions/virtual-servers/vmfaq https://www.atlantech.net/blog/virtual-servers-vs.-physical-servers-which-is-best
Physical Servers vs. Virtual Machines: Key Differences and Similarities
https://en.wikipedia.org/wiki/Virtual_hosting#cite_note-1 https://www.idkrtm.com/history-of-virtualization/ https://searchservervirtualization.techtarget.com/feature/Whats-the-difference-between-Type-1-and-Type-2-hypervisors https://en.wikipedia.org/wiki/Shared_web_hosting_service https://www.redhat.com/en/topics/containers/whats-a-linux-container

Single responsibility principle

If you like to read about object oriented programming design, probably you’ll enjoy my post about telling don’t ask here.

Single responsibility principle is also known as SRP is a computer programming principle that consists of classes, functions or modules that can’t have more than one responsibility.

The term itself was introduced by Robert C. Martin as a part of what he calls “Principles of Object Oriented Design”.

To be more pragmatic in the text I’ll refer only to classes, but you can consider the same for modules. At the end of the text, I’ll discuss functions.

But what we define as responsibility? Responsibility can be described as each action that your classes are assigned to do, these actions usually have business logic and can be used many times, and the more important, exposed to changes. It leads us to the root of SRP, one class should have only one reason to change. If a class has more than one responsibility, then these responsibilities become coupled by with themselves, and changes to one can impact the other. This kind of problems results in poor designs that violate the basics of object oriented design.

Why it violates the basics of OOD? If you were asked to describe the purpose of object oriented programming in a few words, what would you answer? I would respond “easy to change”, but how to reach it?

Today we have a lot of techniques, patterns, and references to reach good design in your applications, but the most basic to get started is to follow SRP.

Classes with only one responsibility provide a portion of well-defined behavior that can be reused in any part of your application that needs through its public interface, your application will have a lot of small pieces of code called classes, exchanging messages between them, like cells in an organism providing life to your object-oriented application.

Classes exchanging messages

If your classes have only one responsibility and you need to change some business logic at this responsibility, you’ll only need to change it in one place, it will be easy to change.

What should belong to my class

Related Posts

Your class always needs to have only one responsibility, but how to know if it is doing only one thing?

You can analyze your class name if it contains “and” or “or”, like “EmailValidatorAndSender” probably it has more than one responsibility.

class EmailValidatorAndSender; end

Ruby empty class syntax

You can try to describe your class in a few words too, and if you use “and” or “or” your class has two responsibilities too.

Another thing that you can do is to analyze your class public interface, and see if it makes sense when compared with the class name and other methods.

class EmailSender
  def valid?(email)
     # …
  end

  def send!
    #..
  end
end

Analyzing this class public interface, the method “#valid?” makes sense? It should belong to this class if we analyze its name and other methods? Of course, no!

These two examples were very easy, the real problem stands when your multi-responsibilities is hidden inside your business logic, for example:

class ProductBuyer
  attr_reader :user, :product, :address, :payment_data

  def initialize(user, product, address, payment_data)
    @user = user
    @product = product
    @address = address
    @payment_data = payment_data
  end

  def buy!
    # ... any logic ...

    send_email if valid?

    # ... any logic ...
  end

  private

  def send_email
    EmailProvider.send(
     email: user.email,
     template: File.open("any/path/to/template.html")
    end
  end

  def valid?
    user.email =~ /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
  end
end

This class has a lot of logic, it is a class to buy products but it contains email validations and sends emails, this is dangerous! Why it is dangerous? When your multi-responsibilities are coupled with your class logic is harder to realize it, and the tendency is to you duplicate code when you need to validate email in other parts of your application, you probably will duplicate code:

class UserRegistration
  # ...

  private

  def validate_email
    user.email =~ /\A[\w+\-.]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+\z/i
  end
end

If someday you need to change the logic of email validation, you’ll need to change it in two distinct parts of your code.

If it has more than one responsibility, and consequently more than one reason to change, it will lead your application to poor design and consequently harder to change.

Design methods with a single responsibility

A well-designed class has only one responsibility composed of many methods with single responsibility! Don’t create the famous “sausage method” with a lot of lines, behaviors and decisions. You can extract logic to private methods, it will reduce our code complexity and boost its readability.

Side Effects of violations

Constant violation of the SRP can cause headaches and leads your application to poor design, the first effect is the developers’ productivity falling down, your application will have a poor design and will not be easy to change, developers of the system probably will need more time to make changes.

If your application classes have more than one responsibility probably you can’t import it and just use the behavior that you want, you’ll get a lot of undesirable code together. It’s dangerous when you want to use the behavior of one class that does more than one thing, but this behavior isn’t exposed through the public interface of its class because it is coupled, in this case probably your violation of SRP will end with a duplicated code.

Silver Bullet

It’s always good to remember that it isn’t a silver bullet, your application design will not be good only using SRP. It is the easiest way to get started with good design practices, but only use it will not be the answer to your problems.

Like every pattern or practice, the heavy use of it without rational thinking can lead you to a disaster, don’t start applying SRP everywhere, with time and experience you’ll start to write code with single responsibility naturally.

Final thought

If you liked this post probably you’ll enjoy my post about tell don’t ask here.

If you have any questions that I can help you with, please ask! Send an email (otaviopvaladares@gmail.com), pm me on my Twitter or comment on this post!

Follow my blog to get notified every new post:

Antipatterns of stability

In the last few weeks, I was reading the exceptional book Release It! Design and Deploy Production-Ready Software by Michael T. Nygard, this book is fantastic and I pretty recommend the reading for everyone, one of the chapters that have caught my attention was the part that he talks about patterns and antipatterns of stability.

System stability and availability is always a recurrent topic, it’s a mistake to think that errors will not occur, it always occurs and on the worst time. Based on this, you need to defend your system of common events that when occurs kill your application and user experience.

At this post, I’ll write my points of view about some systems stability antipatterns, some of these presented by Nygard at his book.

It’s important to note that I don’t show all the antipatterns presented by the author and that I made changes in some patterns to adapt to my point of view and explanation.

Antipatterns

We’ll talk about the antipatterns, they can fuck up your app if you don’t pay attention to them. Each of these antipatterns will create, accelerate or multiply or system failure.

Integration Points

All integration points is a risk to your systems, everyone can kill your operations or cause a headache to your team. And it’s important to remember that database call is an integration point too.

Let’s see an example, suppose that you have an application that needs to communicate with a 3rd party API that will give you essential information to your customer sign up, your customer can’t continue the registration flow while the external service doesn’t respond.

Integration Point Failure

You can find similar scenarios across multiple companies, on this image we have a lot of possible events that may cause an error in your operation. Most developers when first look at this image, will think “Ok, there’s a problem! If the external application goes down, I’ll have a 404 response, my system will crash and I’ll not have users register while they don’t come back.”. This assumption is true but isn’t the only problem. How your system will behave if…

  • The provider receives your request but never respond? If you don’t handle it, your user can wait forever!

  • The provider receives your request and responds after 60 seconds? It can occur if the 3rd API is facing a DDoS attack, an exceptional number of requests or database high percentage of use.

  • The response comes with a status code that you don’t know how to handle, like “423 Locked”. Probably you’ll get an error at your error tracking service.

  • The response comes with a content type that you don’t know how to handle. It will go to your error tracking service too.

  • And finally, you can get your response. But the response doesn’t come with the information that you expects. Imagine that you’re waiting for a JSON with the key “name”, but the JSON came without this key, what will happen on your application? If you don’t handle it, probably you’ll get an error trying to access this key. And again you’ll have a new error at your exception tracker.

Integration points are the number one risk of your application stability, but it’s unavoidable, today most applications are composed of microservices architecture and SaaS APIs, this is the main reason that I don’t consider it as an antipattern, for me the antipattern is each problem that came with each integration point.

Each of the problems described can give rise to a different antipattern, and you need to treat and don’t let it warm your system, the antipatterns can be named as:

  • The endless request
  • The long request
  • The unexpected response code
  • The unexpected content type.
  • The unexpected content.

A lot of patterns that I’ll describe above (at “Patterns of Stability” section), can help you against each of these problems, for example, the famous circuit breaker pattern.

Every integration point will fail, and you need to be prepared.

Cascading Failure

Check my last posts:

A cascading failure occurs when an error in one system affects others, with the initial failure walking down into your systems layer causing other errors.

Today majority of our applications are composed of microservices communicating between each one, and each service has its own dependencies, like a database, Redis and other services, each service has it’s callers too, which probably depends directly on its stability.

Microservices communicating between each other

What will happen if one of these services goes down or have a bug that makes the service unable to respond to one endpoint incoming requests?

Fail communication between services

Of course, the callers of these endpoints will have problems too, but how they handle it will define its future, or it will fail and start a cascading of failures or it will handle it correctly and don’t start a crisis in your system.

This problem is similar to problems related to integration points, but it’s more about the mash of your systems itself. Circuit Breakers and failure modes can defend your system against this kind of problem, you shouldn’t trust your system stability, always be ready for any issue, because it will always occur.

Users

Don’t store user sessions in memory, if you have a lot of users and crawlers navigating on your website, it can start an out of memory problem on your application.

Blocked Threads

Follow my blog to get notified every new post:

Blocked Threads is a point of attention, many failures like the chain of reactions and cascading failure start with a blocked thread. Like cascading failures, it usually occurs around a resource pool and integration points.

Use timeouts, especially on database connection to avoid problems with blocked thread. If you use any ORM it should already implement a friendly way to set timeout and threads pools on databases, ActiveRecord does it.

And obviously, use circuit breaker timeout on callers too.

Self-denial Attacks

Self-denial Attacks occur when someone tries to kill your own website, it’s like a suicide!

Let’s see an example: Your marketing pays U$5M for LeBron James tweet something about your company, but they don’t tell anything to technology guys… What will happen when he sends the tweet for 40M people? Of course, your website will be a burst of visitors, and how it will behave with this? And if I tell you that your marketing team asked LeBron to tweet a deep link to a promotion, your application will still available?

Good marketing can kill your system, to prevent this kind of situation communication is the key factor, create static landing pages for users first interaction is recommended, and remember, never send deep links that bypass your CDN.

Shared Resource

If you have a many-to-one or many-to-few resource relationship you probably will have a side effect when scaling. Let’s see a famous example, you have three applications using the same Redis:

Three services using the same redis

If you increase by 2,3,4…x the number of services using this Redis instance, probably you’ll start to have a bottleneck.

Seven services using the same redis

To avoid this kind of problem, on the book Michael T. Nygard propose a “shared-nothing” architecture by reducing the shared resources, reducing the number of services calling the shared resource.

Each service with a unique resource

Dogpile

Let’s suppose that you are working for a big newspaper and the homepage executes a big query to list all the latest news by relevance, this query takes 6 seconds on average. You’re a good engineer and decides to use Memcached to stores this query result, after this the query is only executed to the first user to hit the homepage, after this your application stores the result on Memcached and the next users will have an instant response from cache, it’s working.

But what will occur when this cache expires? If your application is expiring the cache every five minutes, your query will be executed again every 5 minutes. If you have only one concurrent user it’s not a problem, but let’s imagine if you have four concurrent users, 24 requests will hit your server and executes the same query that consumes high CPU usage. Depending on the size of your application and the number of concurrent users, it can be harmful.

It is the most famous kind of dogpile and the solution is to code a “smart cache system”, you can find this solution on the internet for your favorite language.

Another common type of dogpile is with cached assets on your CDN, pay attention when you invalidate it, if you have all your user base hitting the CDN to view a common image or asset and you invalidate it, all these users will hit your server directly.

Slow Responses

Slow responses are worse than a quick failure, it can be a result of excessive demand, or can also be a symptom of a problem like memory leaks or lack of a server. Slow responses can trig a cascading failure on the caller services (if they don’t implement timeout) or result in more traffic if you have users waiting for responses on your frontend.

To avoid this kind of problem, fail fast is the solution, every time that your system is slow it’s preferable to send an error. Don’t forget to use timeouts on the callers too.

Unbounded Result Sets

Most applications don’t limit a query on the database, sometimes you can be surprised by a query that would return hundreds of results returning a million and your application crashing, this is a small antipattern but pays attention to your query, always limit them and use pagination.

This pattern avoids slow responses and together with timeouts, it can help avoid cascading failures Improves overall system stability.

Conclusion

Stability and resilience are essential nowadays, the antipatterns that I wrote at this post is not all, and they don’t prevent all possible failures, this topic is a vast ocean of knowledge and you can’t stop studying.

No system is perfect, all systems have failures and you need to work to prevent it, but don’t fall into the trap of perfection, you as a developer need to be pragmatic, and sometimes some solutions are over-engineering for your objective.

Final thought

If you have any questions that I can help you with, please ask! Send an email (otaviopvaladares@gmail.com), pm me on my Twitter or comment on this post!

Follow my blog to get notified every new post: