Error Handling Strategies for Serverless Event-Driven Architecture

Written by maharshijha | Published 2022/12/16
Tech Story Tags: software-architecture | error-handling | serverless | serverless-architecture | event-driven-architecture | errors | business-strategy | tips

TLDRServerless streaming architecture is modeled after event-driven microservice architecture. This type of architecture relies on events and their corresponding handlers to process data and trigger actions. If an error occurs in one of these event handlers, it can disrupt the entire system. Error handling is an important aspect of any system, but it becomes especially crucial in event- driven architecture.via the TL;DR App

What is Event Driven Architecture?

Serverless streaming architecture is modeled after event-driven microservice architecture, where each microservice is connected to one another using a message bus.

Event-driven architecture inherently provides all the capabilities we need from a streaming solution. By overlaying managed services provided by cloud providers with an event-driven architecture, one can build a serverless streaming solution. If you can map the above pattern with managed services like AWS Lambda as a microservice and AWS Kinesis as a message bus, you can achieve the event-driven architecture using the serverless stacks.

Error Handling Strategies

In this Architecture, one of the critical architecture patterns to think of is how you handle the runtime Error handling is very critical. Error handling is an important aspect of any system, but it becomes especially crucial in event-driven architectures. This type of architecture relies on events and their corresponding handlers to process data and trigger actions. If an error occurs in one of these handlers, it can disrupt the entire system. This pattern handles millions of messages and processes them every second. This means there are chances of different kinds of runtime errors in this pattern and not handling these errors can turn into critical system failure.

In event-driven architecture, events are generated by various sources, such as user actions, system notifications, or external APIs. These events are typically delivered to a queue or broker, where they are stored until the appropriate event handler can process them.

Event handlers are responsible for responding to events and triggering the appropriate actions. For example, if a user clicks a button on a website, an event is generated and sent to the queue. The event handler for that button might trigger an API call, update a database, or send an email.

When an error occurs in one of these event handlers, it can have cascading effects on the entire system. If the error is not properly handled, it can cause subsequent events to fail, resulting in a domino effect of errors. This is why error handling is so important in event-driven architecture.

There are several strategies for handling errors in event-driven architecture. One common approach is to use a circuit breaker pattern, where the system monitors the health of event handlers and automatically stops sending events to handlers that are failing. This can prevent the domino effect of errors and allow the system to continue functioning, even if some event handlers are failing.

Another strategy is to use retries and backoff. In this approach, the system automatically retries failed events a certain number of times, with increasing intervals between retries. This can help to recover from transient errors, such as network connectivity issues, which might resolve themselves after a short delay.

Error logging is also important in event-driven architecture. Detailed logs can help to diagnose and troubleshoot errors and can provide valuable information for improving the system.

The nature of the Event Driven architecture is more stateless and microservices are simply good at doing one thing, so when you think about Error handling in the Event Driven Architecture, you need to think about whether the given message should be retried or not.

Let’s talk about some of the techniques mentioned above in more detail.

Retriable Errors

Retriable Errors can be handled based on which level that error is happening. For Example, If the Error is happening due to a System issue like a network error or some API error then you might want to keep trying to resolve it before you take the next event. The reasoning behind this is, if you drop the event or put it into the queue for later processing then there is a high chance that the same kind of error could happen for all the next events and your application will fail in the domino manor. However, if you keep retiring from this event and not taking the next event then at least you are not losing the event until it’s fixed.

In the System-level retriable error mentioned above, you must be very thoughtful otherwise you will block the entire system. For Example: if you keep retrying the event for which the error is specific to that event like an invalid request then you don’t want to retry it and drop that event.

For Above mentioned Errors, which are more the Event specific. You might want to try the Backoff Error handling. In backoff Error handling, you may send the event at the end of the Queue and can try it a configured number of times before discarding the event.

Circuit breaker pattern

Think about the Circuit Breaker pattern as the Fast Fail functionality, where let’s say your application is failing for one of the messages then you might want to break the system for that event and send it to the separate Queue also known as the Dead letter queue. Dead later Queue can be thought of as the parking lot where you can retry or investigate the failing event. This kind of event can mostly happen due to the issue in the event format itself, so it often needs manual intervention.


A circuit breaker can be in one of three states:

  • Closed: The remote service is working as expected. No short-circuiting is required.
  • Open: Remote service not responding as expected, down, or frozen. All requests are short-circuited. This state is achieved only when a specified number of failures have occurred.
  • Half Open: Records the number of successful attempts to invoke the remote calls. This helps in checking if the remote server is back online and working as expected.

Conclusion

Overall, error handling is essential for maintaining the reliability and stability of event-driven architecture. By using strategies such as circuit breakers, retries, and logging, it is possible to prevent errors from disrupting the system and ensure that it continues to function smoothly.


Written by maharshijha | Working as the Staff Engineer at Meta and an innovator in the space of Cloud Computing Serverless Technology.
Published by HackerNoon on 2022/12/16