Introduction

We are aware of the age-old adage that failure is the best form of learning. This is also applicable to the design of highly resilient large-scale distributed apps. The most resilient apps are built after years of learning from system failures that have led to several design changes, fine-tuning of the system architecture, and optimization of processes. It is crucial for developers to remain proactive and not wait for failures to occur. Developers should always be on the lookout for systemic problems that could one day cause an end-user impacting outage. During the initial design of the system itself, it is recommended that certain best practices and design patterns are considered based on expected estimates of traffic numbers and usage patterns. This article touches upon two common design patterns for building resilient systems and some tips to implement them in practice.

Types of Failures

Distributed apps generally fail due to hardware errors e.g., network connectivity, server failure, power outage, etc., or application errors e.g., functional bugs, poor exception handling, memory leaks, etc. For the purposes of this article, we aggregate these errors into two different categories to give us a different view:

Transient errors: These errors are expected to go away in a short time e.g., network issues.
Non-Transient errors: These errors are not expected to go away in a short time e.g., complete hardware failure.

Next, we talk about approaches to handle these errors.

Retries

Retries are best suited for transient errors as they usually do not reduce traffic being sent to the service experiencing issues by much. Typically, the following retry strategies are followed:

Simple Retry: Retry until the failing service is healthy again
Retry with constant delay: Retry until the failing service is healthy again but wait for a fixed amount of time between retries. This manages the load being sent to the failing service.
Retry with increasing delay: Retry until the failing service is healthy again but wait for an increasing amount of time between retries. Some developers implement this is an exponential backoff strategy where each failed invocation leads to a longer waiting time between retries. This also manages the load being sent to the failing service though the caller may have to wait a bit longer to get a successful response.
Retry with fixed limit: Retry a fixed number of times. This is like #1 but gives up after retrying for a while. It is important to be judicious while retrying. The failing service must return a sensible error code which should be used to determine whether the error is retriable. Idempotency is an aspect to keep in mind while using retries as some applications cannot tolerate an action being performed twice e.g., charging somebody’s credit card.

Circuit Breakers

Circuit breakers are best suited for non-transient errors. They offer a smarter way for an application to wait for a service to become healthy without wasting CPU resources. The biggest benefit of a circuit breaker is that it rate-limits traffic to a failing service increases the chances of it become healthy sooner. Circuit breakers maintain a count of recent failures for a given operation within a time window and allow new requests to go through only if this number is below a threshold. They are generally modeled as a state machine with the following states:

Open: Do not allow any requests to go through and return a failure.
Half-Open: Allow some requests to go through but if the number of failing requests crosses the threshold, the circuit moves to the open state.
Closed: This is the healthy state and requests are allowed to go through. The circuit breaker monitors the count of failed requests over a time window to decide whether to move the circuit to a half-open state.

Here are some ways to configure circuit breakers:

Per Service: This approach has one circuit setup per service. A Service usually runs on multiple machines. In this approach, a single unhealthy host will not affect the circuit. Generally, the best way to configure this kind of circuit breaker is for the circuit to open only when most of the hosts are unhealthy. The problem with this approach is that if a few hosts of a Service are unhealthy, the circuit could remain open or half-open causing some of the requests to fail which could be unacceptable depending on the application. It is a good practice to set up the load balancer to monitor the health of the service on each host in this case to remove the host from the load balancer.
Per Host: This approach has one circuit setup per host. This is tricky to implement because we now must be aware of the count and identity of each host of the service which essentially adds the burden of client-side load balancing. When the number of failed requests per host crosses a threshold, the client-side load balancer simply stops sending further requests to that host and we are back at a 100% success rate.

The best way to select a configuration is to understand the reasons for service failure and pick the simplest option. Understanding recovery patterns of the failing service play an important role in configuring thresholds e.g., avoid staying in the open state for too long even after the service has become healthy.

Some Practical Tips

The major problem with circuit breakers and retries is that if the failing service does not become healthy quickly there is no way out leading to poor user experience and potential loss of business. It is important to investigate some mechanisms that can be used to offer some functionalities to the user in such cases. A fallback is one such mechanism. There are various types of fallback mechanisms:

“Kiddo” service: This funny-sounding option offers a set of very basic functionalities and can be switched on in case of a major outage to the live service. E.g. if a taxi booking service fails completely, a “Kiddo” service could kick in that allows users to book only one type of taxi. Ideally, the “Kiddo” is hosted in a separate region or a separate data center.
Constants: We all hate constants in code but having a set of local defaults for values that are retrieved from a remote service can save the day when the remote service has been down for several hours.
Cached responses: This option caches some previous successful responses from the remote service and uses them when needed.

Chaos testing is a good way to test retry and circuit breaker configurations. These configurations evolve over time due to learnings from failures. Chaos testing makes it possible to mimic those failures and design reliable retry and circuit breaker configurations.

Conclusion

We have talked about both circuit breakers and retries in detail but there is still a lot to learn about them. I recommend choosing one or the other or a combination of both depending on the failure patterns, application SLOs, traffic numbers, and usage patterns.