Why You Should Run Multiple Applications on the Same GPU (and Why it's so Difficult)

Written by razrotenberg | Published 2022/01/19
Tech Story Tags: gpu | machine-learning | deep-learning | artificial-intelligence | data-science | infrastructure | pc-hardware | hardware

TLDRGPU utilization of a deep-learning model running solely on a GPU is most of the time much less than 100%. The only way for multiple applications to run simultaneously is to cooperate with one another. More advanced GPUs require much more effort to utilize properly. Sharing a GPU by running multiple applications on the same GPU can minimize these idle times, utilize this unused GPU memory and increase the GPU utilization drastically. This can help teams and organizations significantly reduce rental costs, get more out of their already purchased GPUs, and be able to develop, train, and deploy more models using the same hardware.via the TL;DR App

While GPUs are being used more and more, many people aren't utilizing them properly.
You’ll be surprised to know how GPUs are idle a lot more than we think, and how easy it is to use them inefficiently.
One way to increase GPU utilization is by running more than a single application on the same GPU.
We will discuss why low GPU utilization is common, why running multiple applications on a GPU may solve this, and also why this is a real challenge.

Why You Should Share GPUs

GPUs are expensive and are idle too often

Whether purchasing GPUs or renting them in the cloud, both options are expensive; and when you pay a lot of money for hardware, you want to know that you are using it well.
Unfortunately, this is not the common case for GPUs and many teams and organizations find their GPUs idle too often.
Let’s discuss the many times that GPUs are idle.

GPUs are idle when people don’t use them.

GPUs are idle during non-working hours. For example, at night or on weekends when no one ran a GPU-consuming application before leaving the office.
GPUs are also idle during working hours. For example, on coffee breaks, or when going to lunch.
While this is pretty intuitive and might even seem funny, it happens a lot and the truth is that it is not so easy to eliminate these idle times. 
In addition to that, GPUs are idle in even smaller time resolutions which are a bit harder to grasp and might be even surprising.
GPUs are idle when people use them.
A Jupyter Notebook is a good example. A user working on a Jupyter Notebook usually alternates between writing code, executing it on the GPU, and examining the results. The GPU is therefore idle for long periods of time during this process. If the user works in a multi-GPU environment, then it’s even more GPUs that are idle.
GPUs are idle when applications use them.
Most applications have CPU and I/O work in between launching GPU kernels. The GPU utilization of a deep-learning model running solely on a GPU is most of the time much less than 100%.
For example, in medical imaging models, each step can have a few minutes of work on the CPU as well. 

Good architecture can improve this by running CPU and GPU tasks in parallel instead of sequentially but as you might have guessed this requires some effort.

GPUs are too big

GPUs are getting better each year. They are getting faster with more compute power, and come with more memory.
Sometimes, though, more memory is unnecessary.
Experimenting with a new model allows, and sometimes even requires one to use smaller hyperparameters (e.g. batch size, image size, etc.), making the model use much less GPU memory than normally.
There are also plenty of models that would not use all the 32GB of memory of an NVIDIA V100 for example, let alone inference tasks.
Generally speaking, better, larger (in terms of memory), more advanced GPUs require much more effort to utilize properly.
…with great power comes great responsibility
Sharing GPUs by running multiple applications on the same GPU can minimize these idle times, utilize this unused GPU memory and increase the GPU utilization drastically. 
This can help teams and organizations significantly reduce rental costs, get more out of their already purchased GPUs, and be able to develop, train, and deploy more models using the same hardware.
There’s a reason why it’s not common, though, and it’s because it’s hard to do.

Let’s see why.

Sharing a GPU Is Hard

Applications that run on the same GPU share its memory. Every byte allocated by one application leaves one less byte for the other applications to use.
The only way for multiple applications to run simultaneously is to cooperate with one another. Applications should not exceed their allotted portion and allocate more GPU memory than they should. Otherwise, applications can harm each other, even mistakenly.
This is a much less legitimate requirement when speaking about different containers that share a GPU. Containers should not be aware of each other in many environments, and for sure should not be able to communicate with one another.
When different users are supposed to share a GPU, they should decide together how much memory is allotted to every application, making this a logistical issue as well and not only technical.
In addition, many applications assume they run alone on the GPU and allocate the entire GPU memory upfront by default. This is a common strategy when using an external processing unit (i.e. in addition to the CPU).
Therefore, code modifications are required to change this default behavior.
This might sound easy but not every user can do so, as it might require deep knowledge of the internals of the application and how to configure it.
Sometimes it might not even be possible. For example, when receiving a Docker image without controlling what application is running inside and without access to its configuration.

Sharing a GPU Dynamically is Hard

Deciding how much GPU memory is allotted to every application might be relatively easy to do for a single user as there is only a single person who needs to decide.
A small team might also be able to do so but in a very inefficient way.
The team can decide on a strict policy in which each member gets an equal share of the GPU. For example, in a team of three members, each one would be given one-third of the GPU memory.
This might sound satisfying but the GPU would probably be underutilized a lot.
Any time that one of the team members would not use his or her share of the GPU memory, it would be just kept unused.
This unused GPU memory could have been allocated by another team member, allowing him or her to use more memory (e.g. running larger models) or to run more applications (e.g. running more models).
Additionally, the team members would never be able to use more than their share without breaking their agreement and risking applications of other members with out-of-memory (OOM) failures.

Sharing a GPU In a Fair Way Is Hard

Teams and organizations want to share GPUs equally between users and projects and not between applications.
Users should not be granted more GPU memory for running more applications than their colleagues.
The problem is that the GPU knows nothing about the users nor projects and only knows about applications (processes to be more precise). Therefore, all applications get a relatively equal share of compute time.

Sharing Multiple GPUs Is Much Harder

Until now we only talked about a single GPU. Things are much harder when speaking about multiple GPUs.
Let’s consider a cluster of a few machines, each having a few GPUs.
Think of how much bigger the challenge of sharing all these GPUs between many users is, especially while taking into account all the above.
Now there are many GPUs that need to be allocated for user workloads, and there are probably more users involved. The logistical challenge of managing these allocations alone is now much bigger.
Strict policies, like discussed above, could be used but would now underutilize a lot more GPUs.
Also, remember that all the users should modify their workloads to be cooperative and not exceed their GPU memory limit. 
Think of how error-prone this process is when talking about a few GPUs.
Now think how bigger the challenge is when talking about clusters of hundreds of GPUs and users.
And there’s more.
Even if all the workloads are cooperative, and GPUs are allocated dynamically to all the users, suboptimal decisions would have a high impact on the overall cluster utilization.
For example, suboptimal allocation algorithms can cause fragmentation in free GPU memory. Applications might be left without a GPU with enough free memory for it, even though the total free GPU memory in all GPUs is more than enough.
Another example is the case of multiple applications sharing a GPU while there are other free GPUs in the cluster. These applications would suffer from performance degradation in vain.
We went through quite a lot here so let’s try to summarize the key takeaways.
GPUs are idle quite a lot, and they also have plenty of free memory left at most times.
Running more than a single application on the same GPU can utilize this free GPU memory, minimize these idle times and increase the GPU utilization drastically.
This is a big challenge due to the way GPU applications are usually built, and the way users and organizations manage their GPUs.
Also published on: https://medium.com/@raz.rotenberg/running-multiple-applications-on-the-same-gpu-fa74f4c4635d

Written by razrotenberg | Programmer. I like technology, music, and too many more things.
Published by HackerNoon on 2022/01/19