Consistent modelling of Serverless long-running background tasks

Warming Up

In the previous post I’ve proposed a new approach to the Serverless Architecture specification. In this post, I argued, that the current modeling practices use too limited vocabulary and produce artifacts without consistent semantics:

Serverless Architecture Language_Could we make it better?_medium.com

As a running sample I chose the MakirOto application, presented by IL AWS Solution Architects Team at the recent AWS Tel Aviv Summit:

As a result of applying new modeling approach to the MakirOto Frontend component the following high-level Process Model was produced:

MakirOto Process Model so far

In this diagram:

dashed rectangles represent services, implemented as AWS Cloud Formation stacks
arrows denote visibility and access permissions between computations and resources from various stacks

The first post was about modeling online services. In this post, I will take a closer look at consistent modeling of long-running background processes.

For that purpose I will use another MakirOto component — Data Collector. This component is responsible for crawling social networks in order to obtain important information about user connections and interests.

MakirOto Data Collector: Starting Point

This is the DataCollector architecture presented during the session:

MakirOto Data Collector

In this diagram, some icons represent computations and resources:

, while others still represent AWS services:

Connection lines do not have any clear semantics.

Let’s see if we could improve it without too much effort:

MakirOto Data Collector Process Model

This is, indeed, an improvement. Here, every icon represents either

computation process instance: AWS Lambda, Step Function, Fargate Service)
or a fully manged resource: S3 Bucket, SQS Queue

Every arrow denotes visibility and access rights (not control or data flow!).

Although this diagram is now semantically consistent, it is still hard to reason about. It has too many boxes, too many connections, too many unrelated concepts combined together on one page. It is still hard to say what exactly this component does, and how it is related to the rest of the system.

To make progress we need to break this diagram into smaller chunks. Let’s start with the Crawler.

MakirOto Crawler

MakirOto Crawler Process Model

MakirOto Crawler Typical Event Sequence

With these two diagrams we have a better understanding of what’s going on, and can start evaluating alternatives.

The whole purpose of introducing a consistent modeling language for serverless architecture is to enable systematic evaluation of multiple, clearly articulated, alternatives.

Ideally, every model element has to be scrutinized, justified, and compared with possible alternatives. While it might not always be possible due to time constraints, the architecture language must support such process at full extent.

Let’s start with the Worker Fargate service.

AWS Fargate — First Class Serverless Citizen

Although it is not officially admitted by AWS yet, there is growing consensus among AWS serverless practitioners that the AWS Fargate service is the first class citizen in the Serverless Land.

AWS Lambda and AWS Fargate constitute two alternative resolutions for the same “cost vs. control trade-off”. As AWS Lambda, the AWS Fargate model does not break the main constraint of not managing servers directly.

Differences between two options are summarized in the table below:

AWS Lambda vs AWS Fargate

You may find some interesting analysis of the AWS Lambda performance here:

Comparing AWS Lambda performance when using Node.js, Java, C# or Python_How differently does a function perform when using the different programming languages supported by AWS Lambda?_read.acloud.guru

and here:

Comparing AWS Lambda performance of Node.js, Python, Java, C# and Go_An updated runtime performance benchmark of all five programming languages supported by AWS Lambda_read.acloud.guru

You may find an interesting experience report about migrating to AWS Fargate here:

Migrating to AWS ECS Fargate in Production 🚀_Following my talk at the AWS Summit Tel-Aviv 2018, I’m sharing our end to end journey of migrating our production…_medium.com

The main reason for choosing Fargate over Lambda for the Worker is that massive download, especially photos, from social networks may take more than 5 minutes. Also, sometimes, due to external API constraints, this process must not be interrupted.

Notice, however, that when user base grows above certain size, the crawling process will work all the time anyhow — refreshing social networks data for existing users alongside with initial download for new ones. In the steady state, there will be almost no idle to pay for.

The Crawler Tasks queue is also justified — crawling requests may come in bursts when, for example, too many new users are registered. This queue is a good way for smoothing these bursts out.

The second Fargate service, Poller, however, raises some questions. Its only purpose is to periodically check whether AWS Step Function has a pending activity task and to send a corresponding crawling task specification to the Crawler Tasks queue. That will not happen all the time, and we are going to pay for idle.

What, would be an alternative? Quite simple — wrap sending a crawling task request to the queue with a Lambda Function:

Invoking a Lambda Function to Submit a Crawler Task

Notice some subtle yet important changes in naming. Personally I would always prefer more domain-specific concrete names over generic but less informative names such as Poler, Worker, Manager, Dispatcher, Init, Update, and so on.

At the end of each crawling sequence, the Crawler Fargate service directly updates the Data Collector pending activity status. Is it really justified? Why the Crawler micro-service needs to know that it is orchestrated by a Step Function? This seems to be unnecessary coupling. What would be an alternative? One possibility is to use an AWS SNS Topic to signal that the Crawling task is over:

Introducing Crawler Task Status Notification

As a nice side effect we will now be able to keep track of the crawling progress for other purposes, such as monitoring.

AWS SNS is not only one possible notification mechanism. You will find a more detailed analysis here:

How to choose the best event source for pub/sub messaging with AWS Lambda_AWS offers a wealth of options for implementing messaging patterns such as Publish/Subscribe (often shortened to…_medium.freecodecamp.org

There is another subtle problem with the current design. In order to understand what it is, we need to look at the Crawler service implementation model:

Crawler Service Implementation Model

This is a violation of the Open-Closed Principle, which states that the system should support adding new functionality without modification of existing one.

From the Implementation Model above appears that we will need to rebuild and re-deploy the whole Crawler Service Docker image every time we decide to:

support a new social network
extract more data from an existing one
upgrade to a new version of some API

Introducing all these changes will jeopardize operational stability of the whole service.

An alternative would be to encapsulate crawling process for each social network into a separate service:

Individual Crawler Service per Social Network

This diagram looks a bit complicated. Let’s compress it to a high-level overview:

MakirOto Crawler High-Level Structure

Now, individual social network crawlers could be documented separately. For example:

MakirOto Facebook Crawler

We have made a good progress with specifying an architecture of serverless long-running background tasks implemented on the top of AWS Fargate service.

In order to complete the picture, we also need to look at long-running workflow processes, sometimes called Saga, implemented using AWS Step Functions. In the case of MakirOto application, we will need to take a closer look at internal details of the Data Collector service. I will cover this topic in the next post.

Concluding Remarks

Architecture process is primarily about evaluating multiple alternatives and communicating decisions in a clear and unequivocal way. Without multiple alternatives on the table we are at risk of slipping from engineering to ideology, which is bad for business.

To be able to evaluate multiple alternatives and to communicate the final decision, we need to specify them precisely. For that purpose we need a suitable language.

What we use currently is based on very limited vocabulary and inconsistent semantics. For that reason, I started looking for an alternative based on the seminal work of P. Kruchten:

Developing a new language by writing a grammar book is a hopeless task. Languages are living creatures. To develop a language one has to speak it, to write prose and poems. To make bad jokes, if necessary.

In the case of serverless architecture that means analyzing case studies and retelling their stories using this new language. The more, the better.

This is what I started doing with the MakirOto application. The IL AWS Solutions Architect team made a very decent job of picking up a really good sample application. This application will continue supplying great materials for another couple of posts. After that I might start looking elsewhere including analysis of Medium posts tagged with “serverless”.

The current version of the Serverless Architecture Language (shall I call it SAL?) is far from being perfect. It will definitely need to undergo multiple modifications. Personally I believe it may succeed only through a joint community effort. Premature fixation or, Heaven forbids, commercialization would be a fatal blow.

If you have a case study, which you would like to try to retell in this emerging language, drop me a line. Otherwise, stay tuned for the next post. In any case, would love to hear what you think.