An Introduction to Serverless Cloud Import Systems

Written by asher-sterkin | Published 2020/11/14
Tech Story Tags: pyhton | serverless | cloud | cloud-computing | cloud-native | cloudservices | lambda | python

TLDR An Introduction to Serverless Cloud Import Systems (CAIOS) will provide a more detailed account of motivation and internal design of this part of the system. In this paper, a Serverless Python Importer will be covered in detail in detail. In order to understand why we decided to develop it why Cloud Importer plays such a crucial role in CAIOS, we first need to cover the basics. The current state of affairs in package management is a messy patchwork accumulated over decades leading to a lot of confusion and inconsistencies.via the TL;DR App

Python Cloud Importer was developed as a part of the Cloud AI Operating System (CAIOS) project described at a high level in a separate article. Here, we are going to provide a more detailed account of motivation and internal design of this part of the system. In this paper, a Serverless Python Importer will be covered in detail. However, general principles presented are equally applicable to any programming language. In order to understand why we decided to develop it why Cloud Importer plays such a crucial role in CAIOS, we first need to cover the basics.

Industry Landscape

The current state of affairs in package management is a messy patchwork accumulated over decades leading to a lot of confusion and inconsistencies. In this section, we will provide a general overview with particular focus on, yet without too much limited to, the Python ecosystem.
What creates a lot of confusion is that package management exists at multiple levels, and this means different things, multiplied by specifics of each programming language and/or run-time environment. Complete coverage would probably take a full book. In this paper, we will just briefly mention them, hopefully just enough to get a sense of the complete picture:
  • Operating System (Linux) - normally distributed in a form of ISO image files
  • Installation Packages - every Operating System has its own installation packages format and repository structure (e.g. Ubuntu uses Debian, while CentOS-based Linux flavors use RPM)
  • Python - one of the messiest and hard to grasp in its full form
  • Docker images - add another dimension of flexibility and complexity
  • Cloud Environment - not obvious for everybody that a cloud environment itself provides additional ways for deploying and accessing software. For example,  AWS Lambda Layer, AWS Lambda Function, AWS Cloud Formation and provide additional ways of packaging and deploying software on cloud

How did it all start?

In the previous article, some basic principles of treating each individual region of one account of one vendor as a separate Cloud Computer and setting the strategic goal of providing an adequate development environment, suitable for ordinary people, in order to unleash the true power of the Serverless technology disruption. And then, our next question was: “how do we start?” 
We need to somehow get access to 3rd party Open Source libraries and tools and we need someplace to put my stuff there. What shall we do?
Initially, we tried to follow a more traditional route of working with Python venv, using Docker in order to build correctly zip files for Lambda Layers and Lambda Functions. To a certain degree, it worked, but it was slow, cumbersome, and “democratic” would be the last word to categorize that process. 
Then, we started exploring all kinds of proposed solutions for the Serverless Python repository. All of them looked more like quick lift-and-shifts than a fundamental revision of the whole approach. We decided to dig deeper.
And then a moment of epiphany came. What do we really need except for Cloud Storage? Why can't we just import needed libraries from there and put our stuff there as well? No pip install to run by everybody, no Dockers, no zips. Plane and simple. Yes, somebody will need to put 3rd party libraries on Cloud Storage, but this a one-time activity. After that, everybody else will just need to write import xyz and that’s it. Conceptually, this is similar to the Anaconda Distributions but without the need to actually download and install anything.

CAIOS Modules Structure

CAIOS modules structure is illustrated below:
Fig1 : CAIOS Modules
At the bottom there is a caios-core layer responsible for converting high-level programming language code, in this case Python, into low-level cloud service specification. Among other modules, it will include a Python Importer to be described below.
Above caios-core, there is another caios-library layer implementing a kind of Cloud Shelf with the “put by one, use by everybody” operational concept. Here, we could upload as many libraries, binary executables and data files as needed and to make them seamlessly available for cloud functions (aka AWS Lambda) via standard Python import mechanism.
Other modules are less essential for this articles and will be covered elsewhere.

CAIOS Cloud Importer for Python

The Python importlib.abc Finder class hierarchy has the following structure:
Fig 2: Python importlib.abc Finder class hierarchy
The Loader class hierarchy has the following structure:
Fig 3: Python importlib.abc Loader Class Hierarchy
Nice and simple, eh? It took some time to crack this nut, but eventually, we were able to develop a custom Cloud Importer implementation.
CAIOS custom Python Cloud Finder design is presented below:
Fig 4: CAIOS S3ObjectFinder
CAIOS Python Cloud Loader design is presented below:
Fig3 S3ObjectLoader
In this implementation, the S3ObjectFinder, for every get_spec(self,fullname,target) call, obtains a list of keys from Bucket (we use an internal cache for speedup), scans through the list and tries to find the best matching key be it plain Python module, Python extension, package or namespace. Based on the match, for real Python modules, it will return a Spec pointing to either S3SourcesLessObjectLoader or S3ExtensionObjectLoader. We decided not to support the source object loader for security reasons.
Implementing the S3SourcessObjectLoader was relatively straightforward: download the object from S3 and pass byte stream to the parent SourcelessLoader class.
Implementation of the S3ExtensionObjectLoader was more involved. It's not enough to download the object from an S3 Bucket, you also need to signal somehow to Linux that this is a Linux shared object. You also need to check whether this Linux Shared Object depends on some other Linux Shared Objects and to handle them all accordingly in the topological sort order. We eventually settled down using the Python cdll and elftools libraries to accomplish the task. 
The current implementation downloads the Linux Shared Object files to the /tmp folder first, which somehow brings us back to the 250MB disk space limit (we have not encountered any ML library which requires so much for its shared objects so far) and incurs some extra latency. We are currently exploring some more efficient alternatives.
The current implementation of S3ExtensionObjectLoader is illustrated below:
Fig 6: CAIOS S3ExtensionObjectLoader implementation
The basic usage of the Cloud Importer is fairly simple:

Patching 3rd Party Open Source Libraries

Once our Cloud Importer started working we encountered an interesting problem: many libraries use direct file i/o to access internal configuration files or loop over available plug-ins dynamically. Of course, there is no file system for modules imported from S3 and many of them failed.
Proper solutions were found quite quickly - we need to patch these libraries to use more idiomatic Python facilities such as pkgutil.iter_modules(), pkgutil.get_data() and pkg_resources.resource_listdir(). So far, these simple changes were enough to onboard close to 50 most popular Python libraries. Should they want to, we will be willing to share our findings with Open Source library authors and provide further feedback.

CAIOS Cloud Importer Catalog

The S3ObjectFinder solution described above works reasonably well and is good when the list of Python libraries is not stable. In real production, however, this would seldom be the case. Open-Source Python libraries, especially heavy-weight Machine Learning ones, do not change every minute and upgrading to a new version usually requires some careful backward compatibility verification. 
Could we leverage this fact in order to reduce import latency? The answer is yes. We could build a simple catalog that maps Python module full name into S3 object key, if any, and a corresponding Loader object in one shot.
The CAIOS Cloud Importer Catalog class hierarchy is presented below:
Fig 6: CAIOS Cloud Importer Catalog Class Hierarchy
Now, we have flexibility: if S3 Bucket contains a catalog file (simple pickled Python dictionary), then we will use CatalogObjectFinder, otherwise, we will use the S3ObjectFinder, as illustrated below:
Fig 7: CAIOS Cloud Importer

What’s Next?

Having a Cloud Importer for Python opens a lot of new possibilities. First, the same schema is easily extrapolated to other dynamic run-time environments such as Java, JavaScript or .NET. 
Second, there are a limitless number of additional optimizations such as bundling multiple modules, which are imported together anyhow, in one, m.b. compressed, S3 object. 
Third, we may start at last talking seriously about serverless inference and ETL for a much wider scope of practical use cases (so far that were more POC toys). 
Fourth, we could come up with a powerful Intellectual Property protection solution. 
And last, not least, we may start tracking actual usage of every Python module and provide valuable feedback to Open Source libraries authors.
We will report about some of these exciting directions in the forthcoming articles. Stay tuned.

Yes, we follow the news

With the recent AWS announcement of supporting a shared file system for Lambda Function, the question of the viability of Cloud Import was immediately raised. 
Does EFS support mean that the S3-based Cloud Importer value proposition is not viable anymore and its development should be decommissioned? The short answer is No. Let’s see why.
As it was argued in the CAIOS project position paper we treat AWS storage and database services as different types of Cloud Computer memory, each one with its own price/volume/performance ratio. We also outlined there that, in general case, optimizing service packaging structure is too complex a task for humans and should be done automatically based on collected operational statistics and data and Machine Learning Models. From that perspective, shared EFS is just yet another memory cache tier to be included in the automatic or semi-automatic optimization process.

Acknowledgments

The major bulk of code was developed and tested by Anna Veber from BST LABS. Alex Ivlev from BlackSwan Technologies made substantial contributions to Cloud 9 integration with goofys. Etzik Bega from BlackSwan Technologies was the first person on the Earth who realized Cloud Importer potential as an IP protection solution. Piotr Orzeszek performed initial benchmark of the CAIOS Cloud Importer with heavyweight Open Source ML libraries and implemented the first version of Serverless news article classifier.

Written by asher-sterkin | Software technologist/architect. 40 years in the field. Focused on Cloud Serverless Native solutions
Published by HackerNoon on 2020/11/14