AI at Scale: Cloud or Bare Metal

Big Data processing, BI analysis, and AI involve heavy usage of ML, including neural networks. This requires tremendous computational power: hundreds of gigabytes of RAM, tens of CPU cores, as well as graphics cards and/or special chips to speed up calculations.

Problems

When working in a big heterogeneous team (data analysts and researchers, development engineers, system engineers, etc.) on a large system, employees from different teams use various tools to solve problems. This applies both to development and runtime environments (Python, Matlab, Java, C/C++, etc.) and databases used (RDBMS, NoSQL, files, etc.).

Retraining can be time-consuming, expensive, and even impossible in the short run.
The introduction of new technologies can make it undesirable or impossible to stick to a universal toolkit. For example, most components of an AI system may be implemented in Java but you may need libraries in Python, C/C++, and other languages to implement a new model. You will have to integrate them: reimplementing these features in the current toolkit may take a long time and be costly for the business.
Vertical scaling may be too expensive or impossible. Of course, it depends on the capacity of your provider’s data centers but, in practice, all of them have their limitations. This sets a limit on boosting computational resources in a single node. You will have to use several nodes grouped into a cluster to achieve the required capacity.
In some cases, when specialized computing equipment (like Nvidia, Tesla, and Tensor Cores) is required, it is reasonable to buy your own servers and install them in your provider’s data centers.

Solutions

Containerisation

To solve the problems of toolkit heterogeneity, OCI-compliant systems are used. A containerized application is an application and its runtime environment in a particular format that runs in an isolated environment in a container management system.

Containers can contain almost any application and system-level applications (WEB, Data, ML, applications, DNS servers, NET balancers, etc.). All of them will run in an isolated environment to avoid conflicts between system libraries and to differentiate access and resources in a flexible way.

Docker is the most popular container platform. It includes both tools to build containerized applications and a daemon that lets you run and manage containerized applications within a single node.

Orchestration

The core problem that Kubernetes is solving is the ability to manage containers at scale.

Kubernetes is the most well-known and standardized cluster orchestration system. Kubernetes clusters are available from almost any major cloud provider (AWS, Azure, GCP, etc.).

Of course, there are other containerization technologies and container cluster management systems, but Kubernetes on Docker is the most common and standard solution at the moment.

Let’s start with the special considerations to be taken into account when choosing a solution for Kubernetes, and then proceed with the specifics of the AI systems architecture in a Kubernetes cluster. As described above, the main options are a managed cluster in the cloud and your own cluster on bare metal. Let's take a closer look at these cases.

Cloud

Cloud providers offer scalability, ease of use, and high availability through their robust infrastructure and redundancy measures. Cloud services provide flexibility, accessibility, and the ability to seamlessly integrate with other applications. Additionally, cloud providers often have robust security measures and disaster recovery solutions in place, providing reliability and data protection.

Bare Metal

Bare metal infrastructure provides dedicated hardware resources and offers more control and customization options compared to cloud computing. It is well-suited for workloads that require high performance, low latency, and specific hardware configurations. Furthermore, bare metal infrastructure allows for greater freedom in choosing and implementing specific platform-as-a-service (PaaS) and software-as-a-service (SaaS) solutions, without being limited to proprietary offerings often provided by cloud providers.

Comparison

Cluster in the cloud

Pros

Maintenance is performed by the cloud provider. You don’t need a separate cluster administrator, most of the time developers will be able to manage the cluster themselves.
Databases and other infrastructure with automatic scaling (S3, DynamoDB, RDS, Redshift, etc.) are available.
The number of cluster resources can be changed quickly if necessary.
Malfunctions are less frequent.

Cons

Expensive (please see the comparison below).
Higher probability of vendor lock-in when the provider’s technologies are used (e.g., DynamoDB).
Malfunctions (when they occur) take longer time to fix.
A limited list of appropriate equipment, and you can’t use custom hardware.
In many cases, no transparent pricing plans.

Comments

Of course, it is easier to use a cloud cluster than to deploy your own, at first. But then, when you encounter problems with troubleshooting, finding the cause and possible solution can take much longer and be more painful. You will be limited in your access to the system level, which entails additional difficulties in diagnosing and solving these problems. Some you may not be able to solve by yourself without the access to the system level, so you will have to wait for help from tech support.

Be careful when using a cloud provider's proprietary services. They lead to vendor lock-in. If at some point you become dissatisfied with your cloud provider, it will be harder to stop using their services because you already have solutions implemented on the basis of those products (e.g., proprietary NoSQL storages).

Cluster on bare metal

Pros

Renting bare metal servers from hosting providers is several times cheaper than renting nodes from cloud providers (please see an example of price comparison below).
Greater flexibility in selection, configuration and installation of the equipment. Colocation. You can build your own equipment and install it in a data center. It is relevant if you use specific equipment (graphics cards, tensor cores, hardware encryption, etc.).
Troubleshooting often goes faster as there are no hard vendor lock-ins, which allows for rapid migration and provides a wide choice of solutions.
Greater flexibility in the choice and configuration of systems and services.

Cons

You need a dedicated expert or even an entire administration team for cluster maintenance, including deployment, monitoring, and backup.
Malfunctions are more frequent early on, depending on the skills of your expert.
Scaling requires more time. It may include adding servers, changing the configuration, or migrating to other racks and/or data centers.

Comments

When you choose hardware, give preference to dedicated servers. When using virtual servers, you may notice periodic drops in IO or CPU performance, and this affects the speed and stability of the calculations.

You should also pay attention to the organization of your server network. Keeping servers in the same data center works fine, but putting them on the same rack and switch is far better. High speed and stability of the connection between your servers fix a lot of problems. For bare metal providers, this service is called colocation.

Architecture Deployment

As you know, the Kubernetes cluster consists of control and data planes. Let’s focus on the data plane nodes:

Data application nodes collect data for ML applications. Master Data is storage where data application pods upload prepared data. If the entire dataset is placed into local node storage, it is recommended to fully replicate the data to all the nodes. This significantly reduces the number of failures and network load when ML application pods read the data. ClickHouse and/or MinIO can be used as Master Data.
You can use colocation and connect an entire rack to a single fast 10GE switch. However, even in this case, it is important to group the prepared data properly. Then you will be able to use one of the binary storage formats and avoid overhead expenses for scraping.
Models and related functions run on ML application nodes. They perform inference and record the results. Result storage nodes must be able to take in data analysis results, evaluate them and provide access to them externally. You can use ClickHouse and/or MinIO as a Data Master.
You can write your own scheduler using Kubernetes API or use built-in, Dagster, Airflow, and other schedulers. During the implementation, it is important to control the number of records in etcd for pods. Filled application pods need to be cleaned now and then, otherwise, etcd will become overfilled and the cluster will die.
Load sharing across multiple nodes. When ML applications run, resources are used to their fullest, and this is a typical situation. It is important to always have free resources for system services. An exception is environments for dev testing: they can be located in a single node for cost saving.

Backup and Monitoring subsystems should be in a separate space that does not depend on your Data and ML subsystems. It is important that it includes the creation of a set of backups at different points in time and their testing to manage sudden failures, as part of scheduled drills.

Last but not least are your master nodes that contain the Kubernetes control plane.

Conclusions

If your AI system is in the development phase and does not require constant resource utilization, opting for a cloud provider is a sensible choice. By renting spot and on-demand instances for specific tasks, you can avoid paying for idle resources and potentially save money. However, for AI systems that require continuous cluster utilization and have specific hardware needs, employing a bare metal provider becomes a logical consideration, provided your team has the required expertise.

In terms of disaster recovery and security, cloud providers typically offer robust built-in measures such as data replication, geographic redundancy, and automated backup and recovery solutions. This provides a level of protection against data loss and ensures business continuity in case of unforeseen events. On the other hand, bare metal infrastructure allows for more control over security measures, enabling organizations to implement their own customized security protocols and maintain direct control over data privacy and compliance. However, this also means that the responsibility for implementing and managing disaster recovery and security measures falls solely on the internal SRE.