A Deep Look Into The Service Template Compiler Solution In Python

Context and Problem Statement

As argued in a prior article, an order-of-magnitude improvement in the productivity of the software development process is required in order to realize the full potential of serverless cloud technology. Existing tool chains are conceptually locked into the 50-year old UNIX model and are completely inadequate to support the hyper-fast yet bug-free development process required today.

The software domain needs ways to liberate application developers from any DevOpSec concerns by pushing all boilerplate script generation to automatic tools. In particular, we would like to have a tool that automatically generates all cloud platform-specific deployment scripts from pure application code. While general principles are the same for any cloud platform and/or programming language, to make things tangible within the scope of this article, we will limit ourselves to AWS Cloud and Python as the programming language. The initial problem statement, therefore, could be illustrated as follows:

Fig 1: Python Service Template Compiler Problem Statement

In this article, we provide a general overview of selected Python service models and internal implementation mechanics. The Service Template Compiler solution is built on the top of Serverless Cloud Importer described in more detail in another article. Both technologies are cornerstone ingredients of the Cloud AI Operating System (CAIOS) project initiative led by BST LABS (the advanced engineering arm of BlackSwan Technologies). For additional details about the CAIOS project, please review the original position paper.

Acknowledgements

Mordehay Kontorer and Piotr Orzeszek from BlackSwan Technologies’ BST LABS were active in developing the first version of this component. Scott Lichtman provided invaluable feedback while reviewing the initial draft of this paper.

Decision Drivers

Ideally, we should support automatic service packaging with pure Python code without having to consider the underlying cloud platform and/or communication protocols, thus allowing application developers to fully concentrate on domain logic.

We shall make an even bolder assertion. If application developers needed to deal with low-level details of the underlying cloud platform and/or communication protocol, less attention would be directed to proper domain modeling.

Development speed is the second decision driving factor. Simply put, slow development kills innovation. If too much time and too many people are required to launch the initial version of a service, the company’s management may be reluctant to approve any significant changes going forward. This is the main reason why so many services and whole systems stagnate after the initial delivery. On the other hand, if developing a service is truly affordable, management won’t be too concerned about replacing it with a new and improved version or developing multiple versions in parallel (concurrent engineering).

Within CAIOS, we set out a strategic goal of ensuring that the development of the initial, ready-for-integration version of a simple CRUD REST service should take minutes and the development of a mid- or high-complexity REST or WebSockets service, a few hours. The same goal holds for developing a non-trivial, long-running workflow. We are pursuing ten-fold productivity improvements.

Cost control is the final decision-making factor; let’s reflect on this for a moment. Developing an initial proof of concept (POC) needs to be inexpensive, not only in terms of development effort but also in terms of underlying cloud resource consumption. No one needs, for a POC, multi-region redundancy, data encryption, and network security. What we need is to quickly explore the core application and domain logic to validate whether we have the right solution for the problem at hand. However, this POC version can only take the project so far. Nobody will approve its deployment to a production environment. And the last thing we would like to have to do is to start rewriting to achieve production-hardening adornments.

The same logic applies to porting service from one cloud platform to another. Production environment adjustments must be treated similarly to what happens with normal compilers: changing the target platform and/or optimization modes occurs through compilation switches, not code changes. In our case, the handling of environment adjustments will be split between service packaging and deployment.

Industry Landscape

When analyzing available Service Template Generator solutions for AWS, it would be easier to look at those that come from AWS itself rather than from those provided by 3rd-party open source or commercial projects. Without pretending to be complete, here is a picture we have assembled so far:

Fig 2: AWS Serverless Ecosystem

There are quite a few 3rd-party libraries and tools that aim to provide either more convenient or more portable, serverless cloud application development support:

Fig 3: 3rd Serverless Frameworks

Detailed analysis of the pros and cons of every solution would be a fascinating yet lengthy competitive market analysis; this might be a subject for another paper. Here, it is sufficient to claim that both AWS and 3rd-party solutions can be split into 2 categories, based on feature/benefit:

Make it easier to write AWS CloudFormation templates by either providing YAML macros (e.g., SAM) or using a conventional programming language (e.g. CDK). If we treat the AWS CloudFormation template structure as machine level language, these solutions could be categorized as macro-assemblers — making things a bit easier without increasing the level of abstraction.
Generating AWS CloudFormation templates from a programming language (e.g., AWS Chalice or Zappa) without providing a complete solution, leaving substantial boilerplate configurations to be prepared using the same low-level YAML or JSON. They also tend to leave too many communication protocol details to be encoded in the form of, say, Python function decorators.

In contrast, with the CAIOS Service Template Compiler, we want to generate an underlying cloud platform configuration file(s) completely from a pure high-level programming language (e.g. Python) code with minimal, if any, communication protocol detail mentioned.

Service Template Model

While developing the CAIOS Service Template Model, we evaluated two basic architectural choices:

Automatic conversion based on a naming convention inspired by the Ruby on Rails’ “convention over configuration” doctrine; and
Python class and method decorators, continuing common practice

We chose the first option of using a naming convention since it delivers clear API call semantics while leaving open the possibility for enforcing industry best practices, security and cost control policies.

The second question was whether we should treat a service as a:

Python class; or
Python package (presumably containing multiple Lambda functions)

Each option has advantages and disadvantages. After experimentation, we decided to adopt the first option — defining every service as a Python class — for the following reasons:

It more naturally reflects the service deploy/shut down life cycle through service class instance (objects)
It more naturally reflects service template/service instance relationships through service class/service class instances (objects)
Internally, it needs to be converted into Python packages/modules per Lambda function structure thus addressing the second option, which could be explicitly supported should the need arise
A Python class more naturally reflects service internal resources (e.g. storage) allocations and access
It more naturally reflects service parameters and external dependencies

We, therefore, will treat the last class definition in the service specification module as the service class and will treat each of its methods as a computation unit specification using the following naming convention:

__init__(self, …): service initialization with external parameters
_<function_name>(self, …): simple Lambda function to be invoked from a StepFunctions workflow, resource event trigger or directly (for testing purposes)
on_<function_name>(self, url, connection_id, …): lambda function serving an external WebSockets API call
<verb>_<entity_name>(self, …): lambda function serving an external REST API call (see comments on HTTP API naming convention below)
async _function_name(self, …): internal StepFunctions workflow to be started by some internal API call, event trigger or externally (for testing purposes); could be specified as runforever
async on_<function_name>(self, …): StepFunctions workflow to be automatically invoked by external WebSockets API call
async <verb>_<entity>(self, …): StepFunctions workflow to be automatically invoked by external REST API call

For HTTPI API, we are trying to stick with the REST API naming convention using some common sense-based extended vocabulary. Following the REST API guidelines, we can implement resource creation via HTTP POST method, resource retrieval via HTTP GET method, etc. To make Python code look more natural, we allow some flexibility in verb naming such that it would be possible define register_person (automatically converted to HTTP POST /people) and unregister_person (automatically converted to HTTP DELETE /people/{person_id}) functions rather than more intimidating create_person and delete_person. A more detailed description of the HTTP API naming convention and translation process will be presented in a separate article.

Productivity Benefits and Focus Shift

The proposed service model provides an order-of-magnitude productivity boost. What previously took days and weeks, especially for people without a strong background in cloud technologies, now will take hours or minutes (we measured it). This is especially true for the CRUD-like REST services, where an initial code skeleton could be generated automatically from just a list of entities (or company-wide template).

Now, the main challenge and focus will shift from infrastructure heavy-lifting to proper domain modeling. What kind of entities do we want to reflect in the system? What are their relationships? What kind of operations does the service have to support? These questions will never be easy to answer and no automation will help. We can only hope to eliminate all infrastructure scaffolding concerns, making the main problem clearly visible and securing the attention it actually deserves.

Internal Function Invocations

There is no free lunch… at least not in software. The ease of use and high productivity potential of the CAIOS Service Template Model creates the (partial) illusion that the service class is a normal Python class. In some senses, it is. For example, the same service code could be run locally for testing purposes prior to uploading it to cloud. That saves a lot of time in the initial stages of development.

However, one needs to keep in mind that service class methods are going to be converted into Lambda functions and StepFunctions StateMachines. The most important implication is that if one service class method calls another one, it means invoking either Lambda functions or starting a StepFunctions StateMachine execution. The service class would seldom be the right place for keeping common code to be invoked from multiple places. For non-trivial domain services, it would be more appropriate to extract this common code in a separate conventional Python class or package. But even here, one needs to keep in mind that object properties, unless they encapsulate some storage or messaging resource, are not shareable between Lambda functions. In the future versions, we may consider relaxing some of these restrictions, but for now, they need to be taken into account. That initial training process, however, does not take too much time and newcomers usually ramp up in a couple of days.

Workflow Programming

This is a big topic that deserves a separate discussion. Here, we will touch on only the most important aspects. Many existing solutions advocate specification of workflows in the form of a graph using some static configuration language such as JSON or YAML. For example, AWS StepFunctions StateMachine Language is JSON (in SAM could be specified using simplified YAML). While it might be a valid approach from a workflow engine vendor perspective, we consider that kind of specification to be low-level machine language (of the cloud computer) and strive to compile regular programming language code into it automatically.

Ironically, what happens next is that many vendors provide programming language wrappers for building such JSON/YAML configuration files. For example, AWS StepFunction Data Science SDK does it for AWS StepFunctions and SageMaker, while Apache AirFlow uses Python functions for building workflow graphs. We do not think this approach offers any real improvement and prefer expressing workflow in a plain Python code.

CAIOS Service Template Compiler supports all Python control flow structures, waiting (sleep), and parallel computing; and automatically converts them into corresponding AWS StepFunctions states. This is achieved by mapping Python Abstract Syntax Tree nodes into semantically corresponding StepFunctions States, as illustrated below:

a = b: Pass State
self._<function_name>(…): Task State
return …: Pass State
if …/elif …/else …: Choice State
x = … if … else …: Choice State
while …: Choice State
break: Pass State
continue: Pass State
sleep: Wait State
gather: Parallel State

and more to follow.

Full support of AWS StepFunctions capabilities, such as Express Workflows and SageMaker integration, is planned for future versions.

Python Service Template Compiler Architecture

This section provides a high-level overview of the CAIOS Service Template Compiler (CAIOS STC) architecture, leaving more detailed discussions of specific components to future publications.

Layered Architecture

The CAIOS STC needs to address multiple requirements in terms of cloud-platform portability and seamless extension of the system with support for additional cloud resources (e.g. new serverless database engine or messaging system). Such set of requirements could probably best reflected in the Open-Closed Architecture Principle:

it should be possible to add new capability to the system without changing implementation of existing ones

When we add a new capability to the system, we need to provide a proper solution to presenting new functionality at an adequate (not too low, not too high) level of abstraction, to correct knobs to achieve the right price/performance and security (among other things, correct implementation of the “principle of least privilege”). To achieve these goals we came up with a layered system architecture as illustrated below:

Fig 4: CAIOS Core Layered Architecture

Here is a brief description of each layer (bottom first):

caios-py-kernel: mandatory part of the system, portable across all cloud platforms
caios-py-kernel-aws: CAIOS “hardware abstraction layer” implementation for particular cloud platform (in this case, AWS); starting from this layer it is possible to automatically compile service class with private cloud functions
caios-py-lib: portable plugins defining and implementing specific interfaces and API protocols (e.g. MutableMapping for DB access and HTPP API naming convention parser)
caios-py-lib-aws: system plugins “hardware abstraction layer” implementation for particular cloud platform (in this case, AWS); here abstract interfaces and protocols specified above are implemented on the top of specific cloud resources (e.g. AWS S3, DynamoDB, API Gateway, StepFunctions, etc.)

CAIOS STC (portable part)

We now could describe briefly what happens when a user asks to package (prepare for cloud deployment) some particular service:

Fig 5: CAIOS STC (portable part)

At high level, the compilation process consists of 3 steps:

build service specification document
build service target package
calculate digest for every cloud function (to automatically enforce cloud function cold start when something changes)

Building a service specification document is the most involved and complex process which could be roughly classified as Python Abstract Syntax Tree re-write. Here, we need to parse the original service module, identify the service class (by convention the last class in the module body), deal properly with module imports and globals, and convert every service class function into a separate cloud function.

CAIOS STC (cloud-specific part)

We are now ready to convert cloud-neutral service specification document into cloud platform-specific deployment script (in this case, AWS CloudFormation Template) and to generate boilerplate plumbing code for each cloud function. This process is illustrated below:

Fig 6: CAIOS STC (AWS-specific part)

Here, we implement a complete packaging process for particular cloud platform (in this case, AWS):

First, the service class compilation process outlined above is initiated.
Second, the service configuration is retrieved; if no service-specific configuration is provided, some reasonable default will be used; here all service resources such as database, messaging, API, security, import system, logging, tracing, etc. could be specified at the required level of details for each deployment mode (e.g., dev, test, stage, prod)
Third, an API-specific scan of service functions is performed to identify which type of API Gateway, if any is required for that service
Forth and last, the service (e.g. CloudFormation Stack) template is generated; here, all necessary parts such as template parameters, conditions, resources per cloud function, common resources and outputs are generated in the correct order

What’s Next?

Once we have CAIOS Kernel running on a particular cloud platform, we could start adding 3rd party and project libraries. What is most important, is that every new component could be fully tested locally and on cloud using the CAIOS STC machinery by using simple CAIOS CLI commands. For example, the caios test run command would automatically run unit, local integrated, and remote integrated tests for the service while remote integrated tests run will be accompanied with automatic service compilation and upload to the cloud.

Such extensibility was ensured by an open-end CAIOS kernel namespace architecture, as illustrated below:

Fig 7: CAIOS STC Open Namespace Structure

This structure enables the implementation of different types of cloud functions and resources (e.g. Database) without modification of existing resources. For example, the AWS-specific kernel implementation comes as a relatively straightforward extension of this namespace:

Fig 8: CAIOS STC AWS kernel components

Conclusion

The CAIOS STC architecture allowed us to achieve a 20-fold productivity gain from the very outset. Even more, junior developers without prior experience with AWS cloud were able to start developing REST API services after a couple of days of initial training in the CAIOS development environment and HTTP API naming convention (people just need to wrap their head around automatic conversion of Python functions to HTTP methods).

In this paper, we provided a high-level overview of the project motivation, decision-making factors, and architecture. More detailed discussions of implementing particular interfaces and protocols will come in forthcoming publications. Stay tuned.

Also published on https://python.plainenglish.io/cloud-service-template-compiler-in-python-a78bf28d39e8