How To Choreograph Event-Driven Microservices

Are you trying to claw your way out of the web of API calls that ties your microservices together? Does a seemingly innocent change or bug fix result in a ripple effect across several business serving capabilities? Well, you're not alone.

Microservices have been gaining steam since their introduction as an architectural style in 2011. Initially pioneered by companies like Amazon and Netflix as an alternative to their exploding monolithic codebases, they are now increasingly popular even at companies operating at a much smaller scale than the behemoths. And with good reason. When designed well, microservices are a great alternative to the problems often seen with their monolithic counterparts. The key phrase there though is "when designed well". It's simple enough when you have ten microservices - so scalable, such fun! But when those quickly grow to 50, and then 100, and then 500, you have a real problem on hand if you haven't paid attention to how they all talk to each other.

Imagine you have several microservices all communicating via API calls. In this web of tightly coupled services, changes to one service may necessitate corresponding changes across multiple other services, and scaling one service would necessitate scaling a number of others as well. This problem was first described as the Death Star Architecture.

In the world of microservices, Death Star architecture is an anti-pattern where poorly designed microservices become highly interdependent, forming a complex network of interservice communication. When this happens, the entire thing becomes slow, inflexible, and fragile – and easy to blow up. (ref.)

At this point, you've lost many of the benefits of having microservices in the first place, and in reality, are left with a distributed monolith

So how do you avoid the Death Star Architecture trap and allow your microservices to scale? How do you keep your microservice relatively isolated but still remain an integral node in the set of business flows that it serves? Enter the Event-Driven Microservice Architecture. The golden rule of Event-Driven microservices is that all communication is asynchronous. No API calls for us! Microservices instead publish records of their doings, also known as events. An Event is a record of a business action and must contain all information relevant to that action. Events are published to messaging infrastructure (think Kafka, RabbitMQ) and it is left to consuming microservices to figure out how to operate on them. By removing this tight coupling between services, it's possible to truly reap the benefits offered by the microservices architecture pattern.

Event-Driven Messaging comes in two flavors - choreography and orchestration.

What is the choreography pattern?

Choreography is pretty much what it sounds like! Each dancer in a ballet troupe knows their position and performs their routine based on musical cues. Choreographed microservices behave in the same way - each service (dancer) is aware of it's place in the business flow and acts on certain cues (events).

Let's look at a simplified example of an order processing flow. The customer completes checking out their cart and the following steps need to happen next

An order needs to be created
An email with the details of the order needs to be sent to the customer
Inventory needs to be decreased
A hold needs to be placed on the customer's credit card

The business flow may look something like this

These steps may all be implemented by different microservices - The Orders Service, Communication Service, Inventory Service, and Payment Service. If you had all communication happen via HTTP calls, the Orders service would call the Inventory and Communications Service and the Inventory Service would call the Payment Service. Each of these services now needs to know not just the role they play, but also -

Which services come before and after them in the flow, and
What role these neighbors play.

As the number of your business flows increases, this web of API calls can get untenable to maintain and lead to the dreaded Death Star trap.

How do you model this flow with a Choreographed Event-Driven Architecture?

It's quite a simple concept - each service publishes the actions it has taken to a message stream (think SQS or Kafka). Interesting services can consume from these streams and take the appropriate actions. It would look something like this

In the example above,

The Order Service creates an order with the items in the customer's cart
It then publishes an "Order Created" event to a stream on an appropriate messaging platform.
The Inventory Service and Comms Service consume from this message stream (note, since you have two different consumers of this stream, you would need to use a messaging platform that supports multiple "consumer groups" so that each service can get the full set of events on the stream).
The Inventory Service and Comms service trigger certain business actions based on the data in the "Order Created" event.
The Inventory Service then publishes its own "Inventory Reserved" event to a separate message stream which is read by the Payment Service.
On receipt of an Inventory Reserved event, the Payment Service knows to go ahead and place a hold on the customer card.

We can see that each service selfishly focuses on just its own role without needing to worry about upstream and downstream dependencies. The loose coupling allows for independent development, deploy and scaling of each service. The caveat is that the agreed-on event "contract" is always maintained - there cannot be any breaking changes to the "Order Created" and "Inventory Reserved" events without explicit agreement from the services consuming these events. This architecture pattern is a form of "Emergent Behavior" - the business flows are described not just by the microservices but the relationships between them as well.

What are the benefits of choreography?

This pattern offers all the advantages of a loosely coupled services architecture pattern. Microservices can be scaled up and scaled down independently and isolated development of features is possible. With direct call patterns, making a change on a single service often necessitates corresponding changes on all neighboring services because of the tight coupling.

Most important, unlike the direct call microservices pattern, API errors do not cause cascading failures - if one microservice fails, it does not cause failures in others and the business flows that are not dependent on it can still continue.

But it's not all roses either...

Choreography makes it easy to introduce functionality to the beginning and end of a flow but introducing steps in the middle becomes trickier. Imagine you want to modify our business flow above to add a new step to update accounts.

The above modification of the business flow translates to the corresponding changes in our service flow

We've added the Accounts Service and made it consume from the "Order Created" event stream. We've also created a new "Accounts Updated" stream that the Accounts Service produces. These changes are relatively straightforward. However, editing the Inventory Service is trickier. The service needs to stop consuming from the "Order Created" stream and start reading from the "Accounts Updated" stream instead. In-flight "Order Created" events need to be handled to avoid message loss. Timing these changes across services such that they do not cause unwanted side effects needs careful consideration and coordination. And this effort exponentially increases as your business flows get more complex.

The other downside of a choreographed workflow pattern is that observability and troubleshooting are not straightforward. You need to have a good understanding of the business flows and which services they traverse. Tracing the path of an entity through this architecture would involve diving into multiple logs and dashboards. Troubleshooting is equally cumbersome involving querying the materialized state of the events to understand where an error could have potentially occurred.

So when should you use the choreography pattern?

Choreography is a good choice when

You have well defined and documented business flows
You have a limited number of services (under 20 in my opinion)
You are not making frequent changes to long-running business flows

If you need to update long-running business flows often and have a fast-growing number of services, you should reconsider using a pure choreography pattern.

In short..

The choreography pattern offers a great alternative to traditional direct call microservices by doing away with the tight coupling between services. They do have their downsides because of the difficulty in adding intermediary steps in a business flow and the complicated monitoring and observability (although not more so than in the direct call pattern). The choreography pattern is suitable for business workflows that aren't too complex and don't involve too many decision points. An alternative to choreography is the orchestration pattern which is a good option for the more gnarly business flows - more to come on that in another post!

Thank you to the good folks at excalidraw.com and the libraries by @Youri Tjang and @Kaligule for the drawing tools!