Zombie capacity is any infrastructure piece that looks like it’s doing something, but in reality, is lying unused and should be killed. Zombie capacity can accumulate quickly and can be one of your largest infrastructure debts. The zombies come out of the dark when you get your cloud bill, your users complain about your system’s performance or availability, or you look at cost or usage metrics that look shameful to you and are hard to defend to your CTO.

It all comes to understanding the dark side of running your applications or microservices on top of cloud infrastructure, which is capacity management.

Capacity Management is the dark side of cloud-native applications that is usually ignored

Capacity Management is the science, and the art, of balancing performance, cost, and resources. You would like to get the best possible performance with the least amount of resources.

But given the billing complexities of cloud infrastructure, you also want to manage which model are you using to maximize your return on investment (ROI).

Capacity Management is the science, and art, of balancing performance, cost, and resources.

Take a look at the illustration below to understand factors impacting capacity management.

Application/microservices performance is defined as the user experience and overall system responsiveness.
Resources are CPU, memory, and I/O in case of VMs, or they can be higher level PaaS resources, such as managed databases, middleware, etc. I’m focusing in this article on VMs (or IaaS only).
Cost consists of the billing models you use, such as reserved vs. on-demand VMs. It is also characterized by how you group resources.

For example, do you use eight cores in one VM, or distribute them over 4 VMs? This grouping can make a huge difference in your bill.

Why Should I Care?

It depends who you are and what you do :)

If you are a DevOps or SRE engineer, you want to:

Save some sleepless nights when PagerDuty alerts go off because your users have a terrible user experience. On our platform, we’ve observed that poorly distributed resources and old scalability rules are the keys to more than 25% of live site incidents.

Avoid stressful monthly cloud bill reviews. A good chunk of engineering time goes to analyzing (and reacting to) cloud bills. You usually get those questions when the bill significantly goes up without clear business justification. For example, if your bill goes up in one month by 30% without adding that many users or features, this is a big deal for the business and leadership in your company.

If you are a developer, you want to:

Learn how to write better cloud-native microservices by correlating your code and changes to the user experience. Know if your deployed microservice is getting better performance for the resources it got or not. For example, is the new feature or recent bug fix consuming too many resources?

Understand how your microservice behave under real workloads. You want to know if your microservice started to behave unexpectedly at specific workloads or conditions without doing any explicit instrumentations.

If you are an engineering manager, you want to:

Run lean and avoid infrastructure debt. Infrastructure debt is similar to your architectural or code debts. You need to make the best decisions to prevent a slowdown in the fullness of time.

This kind of debt is accumulated when the team keeps allocating the wrong infrastructure under the pressure of moving fast. It becomes harder and harder to keep releasing with decent velocity under the increasing demand to run efficient infrastructure.

Why is Capacity Management a Pain In The Neck?

Capacity management is a moving target. It is impacted by users workloads, changing application/system architecture, and evolving infrastructure. Multiple persons and roles impact applications and infrastructure architecture. They work with different motivations that are sometimes conflicting.

Capacity management is impacted by users workloads, changing application/system architecture, and evolving cloud infrastructure.

Factors impacting capacity management of cloud-native applications and infrastructure

Also, each one of these three factors moves at a different velocity.

Users workloads change every few seconds or minutes, which impacts your applications performance and infrastructure utilization.

Application architecture changes every few months, if not faster,depending on the team’s velocity.

It impacts users experience and the use of infrastructure capacity.

Infrastructure technologies evolve every few months. It impacts the performance of the application, and eventually user experience.

For example, using compute-optimized instances improves the performance of CPU intensive microservices.

Using the right type of desks and network interfaces positively impact your databases.

What Should I Do?

If you are a DevOps or SRE, you need to focus on the following

User Experience

Measure, Characterize, and link users workloads to microservices

Characterize workloads by measuring their intensity and latency throughout the day.Figure out if there are hourly, daily, weekly, or seasonal patterns.

Quantify these patterns, i.e., number of API call of each feature, variability of workload.Profile each feature by identifying impacted microservices and measure CPU, memory, and I/O consumed to satisfy each API call (or feature).

Performance and Profile of Microservices

For each microservice understand if you are over or under budgeting resources. If you don’t have a budget, at least create a baseline from workloads you measure in the previous step.Profile different microservices by identifying whether they are CPU, memory, or I/O intensive.

Infrastructure

Identify zombie VMs. These are VMs that can be killed and have their current workloads moved to other VMs. Just look at the three common dimensions, CPU, memory, and I/O (network mainly), to identify these underutilized VMs.Match services profiles to the right VMs.

Running your microservices to a general compute VM does not save your day. If your services are compute-intensive, you need to run them on compute-optimized instances, such as C5 family on AWS. The C5 family will give you much higher performance and scalability value for each dollar you pay to AWS.

If you are a Developer, you need to focus on

Create a baseline for your microservices. How much horsepower (CPU, memory, and I/O) does your microservice need to serve a specific unit of workload per second?

For example, how many API requests per second can your microservice serve with 2 CPU cores, 4GB memory, and 10Gbits network? Baseline your microservice at different workloads. Track if this baseline you created changes from one release to another.

A common mistake here is not tracking minor releases. Sometimes minor release introduces bugs that disrupts that pipeline significantly. You want to know about that as soon as it happens.

If you are an engineering manager, work with your team on these

Figure out the right KPIs (Key Performance Indicators). A single KPI won’t give you the full picture of your capacity. Your team should track at least one KPI per capacity dimension. Here you go some examples:

(1) Cost KPIs: cost per user or cost per operation (direct and indirect), or cost per microservice,

(2) Performance KPIs: APIs latency (90, 95 and 100 percentiles of users),

(3) Resources KPIs: cost per CPU, effective CPU cost (utilization included), cost per memory GB.

TLDR

Capacity management is the dark side of cloud-native applications that is usually ignored.
Capacity management is the science, and art, of balancing performance, cost, and resources.
You should care about capacity management because it will save you sleepless nights, difficult questions around the cloud provider bill, level up your cloud-native software development skills and eliminate cloud infrastructure debt that can accumulate very quickly.
Capacity management is a pain due to many factors impacting it, namely: users workloads, changing application/system architecture, and evolving cloud infrastructure.
Measure, characterize and link users workloads to microservices.
Create a reasonable KPI for each of the factors impacting your capacity management — details below.

Act Like a Cloud Genius — Kill Zombie Capacity Quickly

Why Should I Care?

Why is Capacity Management a Pain In The Neck?

What Should I Do?

TLDR