GPUs & Kubernetes for Deep Learning

A few weeks ago I shared a side project about building a Building a DYI GPU cluster for k8s to play with Kubernetes with a proper ROI vs. AWS g2 instances.

This was spectacularly interesting when AWS was lagging behind with old nVidia K20s cards (which are not supported anymore on the latest drivers). But with the addition of the P series (p2.xlarge, 8xlarge and 16xlarge) the new cards are K80s with 12GB RAM, outrageously more powerful than the previous ones.

Baidu just released a post on the Kubernetes blog about the PaddlePaddle setup, but they only focused on CPUs. I thought it would be interesting looking at a setup of Kubernetes on AWS adding some GPU nodes, then exercise a Deep Learning framework on it. The docs say it is possible…

This post is the first of a sequence of 3: Setup the GPU cluster (this blog), Adding Storage to a Kubernetes Cluster (right afterwards), and finally run a Deep Learning training on the cluster (working on it, coming up post MWC…).

The Plan

In this blog, we will:

Deploy k8s on AWS in a development mode (no HA, colocating etcd, the control plane and PKI)
Deploy 2x nodes with GPUs (p2.xlarge and p2.8xlarge instances)
Deploy 3x nodes with CPU only (m4.xlarge)
Validate GPU availability

Requirements

For what follows, it is important that:

You understand Kubernetes 101
You have admin credentials for AWS
If you followed the other posts, you know we’ll be using the Canonical Distribution of Kubernetes, hence some knowledge about Ubuntu, Juju and the rest of Canonical’s ecosystem will help.

Foreplay

Make sure you have Juju installed.

On Ubuntu,

sudo apt-add-repository ppa:juju/stablesudo apt updatesudo apt install -yqq juju

for other OSes, lookup the official docs

Then to connect to the AWS cloud with your credentials, read this page

Finally copy this repo to have access to all the sources

git clone https://github.com/madeden/blogposts ./cd blogposts/k8s-gpu-cloud

OK! Let’s start GPU-izing the world!

Deploying the cluster

Boostrap

As usual start with the bootstrap sequence. Just be careful that p2 instances are only available in us-west-2, us-east-1 and eu-west-2 as well as the us-gov regions. I have experience issues running p2 instances on the EU side hence I recommend using a US region.

juju bootstrap aws/us-east-1 — credential canonical — constraints “cores=4 mem=16G root-disk=64G”# Creating Juju controller “aws-us-east-1” on aws/us-east-1# Looking for packaged Juju agent version 2.1-rc1 for amd64# Launching controller instance(s) on aws/us-east-1…# — i-0d48b2c872d579818 (arch=amd64 mem=16G cores=4)# Fetching Juju GUI 2.3.0# Waiting for address# Attempting to connect to 54.174.129.155:22# Attempting to connect to 172.31.15.3:22# Logging to /var/log/cloud-init-output.log on the bootstrap machine# Running apt-get update# Running apt-get upgrade# Installing curl, cpu-checker, bridge-utils, cloud-utils, tmux# Fetching Juju agent version 2.1-rc1 for amd64# Installing Juju machine agent# Starting Juju machine agent (service jujud-machine-0)# Bootstrap agent now started# Contacting Juju controller at 172.31.15.3 to verify accessibility…# Bootstrap complete, “aws-us-east-1” controller now available.# Controller machines are in the “controller” model.# Initial model “default” added.

Deploying instances

Once the controller is ready we can start deploying services. In my previous posts, I used bundles which are shortcuts to deploy complex apps.

If you are already familiar with Juju you can run juju deploy src/k8s-gpu.yaml and jump at the end of this section. For the others interested in getting into the details, this time we will deploy manually, and go through the logic of the deployment.

Kubernetes is made of 5 individual applications: Master, Worker, Flannel (network), etcd (cluster state storage DB) and easyRSA (PKI to encrypt communication and provide x509 certs). In Juju, each app is modeled by a charm, which is a recipe of how to deploy it.

At deployment time, you can give constraints to Juju, either very specific (instance type) or laxist (# of cores). With the later, Juju will elect the cheapest instance matching your constraints on the target cloud.

First thing is to deploy the applications:

juju deploy cs:~containers/kubernetes-master-11 --constraints "cores=4 mem=8G root-disk=32G"# Located charm "cs:~containers/kubernetes-master-11".# Deploying charm "cs:~containers/kubernetes-master-11".juju deploy cs:~containers/etcd-23 --to 0# Located charm "cs:~containers/etcd-23".# Deploying charm "cs:~containers/etcd-23".juju deploy cs:~containers/easyrsa-6 --to lxd:0# Located charm "cs:~containers/easyrsa-6".# Deploying charm "cs:~containers/easyrsa-6".juju deploy cs:~containers/flannel-10# Located charm "cs:~containers/flannel-10".# Deploying charm "cs:~containers/flannel-10".juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.xlarge" kubernetes-worker-gpu# Located charm "cs:~containers/kubernetes-worker-13".# Deploying charm "cs:~containers/kubernetes-worker-13".juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.8xlarge" kubernetes-worker-gpu8# Located charm "cs:~containers/kubernetes-worker-13".# Deploying charm "cs:~containers/kubernetes-worker-13".juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=m4.2xlarge" -n3 kubernetes-worker-cpu# Located charm "cs:~containers/kubernetes-worker-13".# Deploying charm "cs:~containers/kubernetes-worker-13".

Here you can see an interesting property in Juju that we never approached before: naming the services you deploy. We deployed the same kubernetes-worker charm twice, but twice with GPUs and the other without. This gives us a way to group instances of a certain type, at the cost of duplicating some commands.

Also note the revision numbers in the charms we deploy. Revisions are not directly tight to versions of the software they deploy. If you omit them, Juju will elect the lastest revision, like Docker would do on images.

Adding the relations & Exposing software

Now that the applications are deployed, we need to tell Juju how they are related together. For example, the Kubernetes master needs certificates to secure its API. Therefore, there is a relation between the kubernetes-master:certificates and easyrsa:client.

This relation means that once the 2 applications will be connected, some scripts will run to query the EasyRSA API to create the required certificates, then copy them in the right location on the k8s master.

These relations then create statuses in the cluster, to which charms can react.

Essentially, very high level, think Juju as a pub-sub implementation of application deployment. Every action inside or outside of the cluster posts a message to a common bus, and charms can react to these and perform additional actions, modifying the overall state… and so on and so on until equilibrium is reached.

Let’s add the relations:

juju add-relation kubernetes-master:certificates easyrsa:clientjuju add-relation etcd:certificates easyrsa:clientjuju add-relation kubernetes-master:etcd etcd:dbjuju add-relation flannel:etcd etcd:dbjuju add-relation flannel:cni kubernetes-master:cni

for TYPE in cpu gpu gpu8dojuju add-relation kubernetes-worker-${TYPE}:kube-api-endpoint kubernetes-master:kube-api-endpointjuju add-relation kubernetes-master:cluster-dns kubernetes-worker-${TYPE}:kube-dnsjuju add-relation kubernetes-worker-${TYPE}:certificates easyrsa:clientjuju add-relation flannel:cni kubernetes-worker-${TYPE}:cnijuju expose kubernetes-worker-${TYPE}done

juju expose kubernetes-master

Note at the end the expose commands. These are instructions for Juju to open up firewall in the cloud for specific ports of the instances. Some are predefined in charms (Kubernetes Master API is 6443, Workers open up 80 and 443 for ingresses) but you can also force them if you need (for example, when you manually add stuff in the instances post deployment)

Adding CUDA

CUDA does not have an official charm yet (coming up very soon!!), but there is my demoware implementation which you can find on GitHub. It has been updated for this post to CUDA 8.0.61 and drivers 375.26.

Make sure you have the charm tools available, clone and build the CUDA charm:

sudo apt install charm charm-tools

# Exporting the ENVmkdir -p ~/charms ~/charms/layers ~/charms/interfacesexport JUJU_REPOSITORY=${HOME}/charmsexport LAYER_PATH=${JUJU_REPOSITORY}/layersexport INTERFACE_PATH=${JUJU_REPOSITORY}/interfaces

# Build the charmcd ${LAYER_PATH}git clone https://github.com/SaMnCo/layer-nvidia-cuda cudacharm build cuda

This will create a new folder called builds in JUJU_REPOSITORY, and another called cuda in there.

Now you can deploy the charm

juju deploy --series xenial $HOME/charms/builds/cudajuju add-relation cuda kubernetes-worker-gpujuju add-relation cuda kubernetes-worker-gpu8

This will take a fair amount of time as CUDA is very long to install (CDK takes about 10min and just CUDA probably 15min).

Nevertheless, at the end the status should show:

juju statusModel Controller Cloud/Region Versiondefault aws-us-east-1 aws/us-east-1 2.1-rc1

App Version Status Scale Charm Store Rev OS Notescuda active 2 cuda local 2 ubuntueasyrsa 3.0.1 active 1 easyrsa jujucharms 6 ubuntuetcd 2.2.5 active 1 etcd jujucharms 23 ubuntuflannel 0.7.0 active 6 flannel jujucharms 10 ubuntukubernetes-master 1.5.2 active 1 kubernetes-master jujucharms 11 ubuntu exposedkubernetes-worker-cpu 1.5.2 active 3 kubernetes-worker jujucharms 13 ubuntu exposedkubernetes-worker-gpu 1.5.2 active 1 kubernetes-worker jujucharms 13 ubuntu exposedkubernetes-worker-gpu8 1.5.2 active 1 kubernetes-worker jujucharms 13 ubuntu exposed

Unit Workload Agent Machine Public address Ports Messageeasyrsa/0* active idle 0/lxd/0 10.0.0.122 Certificate Authority connected.etcd/0* active idle 0 54.242.44.224 2379/tcp Healthy with 1 known peers.kubernetes-master/0* active idle 0 54.242.44.224 6443/tcp Kubernetes master running.flannel/0* active idle 54.242.44.224 Flannel subnet 10.1.76.1/24kubernetes-worker-cpu/0 active idle 4 52.86.161.22 80/tcp,443/tcp Kubernetes worker running.flannel/4 active idle 52.86.161.22 Flannel subnet 10.1.79.1/24kubernetes-worker-cpu/1* active idle 5 52.70.5.49 80/tcp,443/tcp Kubernetes worker running.flannel/2 active idle 52.70.5.49 Flannel subnet 10.1.63.1/24kubernetes-worker-cpu/2 active idle 6 174.129.164.95 80/tcp,443/tcp Kubernetes worker running.flannel/3 active idle 174.129.164.95 Flannel subnet 10.1.22.1/24kubernetes-worker-gpu8/0* active idle 3 52.90.163.167 80/tcp,443/tcp Kubernetes worker running.cuda/1 active idle 52.90.163.167 CUDA installed and availableflannel/5 active idle 52.90.163.167 Flannel subnet 10.1.35.1/24kubernetes-worker-gpu/0* active idle 1 52.90.29.98 80/tcp,443/tcp Kubernetes worker running.cuda/0* active idle 52.90.29.98 CUDA installed and availableflannel/1 active idle 52.90.29.98 Flannel subnet 10.1.58.1/24

Machine State DNS Inst id Series AZ0 started 54.242.44.224 i-09ea4f951f651687f xenial us-east-1a0/lxd/0 started 10.0.0.122 juju-65a910-0-lxd-0 xenial1 started 52.90.29.98 i-03c3e35c2e8595491 xenial us-east-1c3 started 52.90.163.167 i-0ca0716985645d3f2 xenial us-east-1d4 started 52.86.161.22 i-02de3aa8efcd52366 xenial us-east-1e5 started 52.70.5.49 i-092ac5367e31188bb xenial us-east-1a6 started 174.129.164.95 i-0a0718343068a5c94 xenial us-east-1c

Relation Provides Consumes Typejuju-info cuda kubernetes-worker-gpu regularjuju-info cuda kubernetes-worker-gpu8 regularcertificates easyrsa etcd regularcertificates easyrsa kubernetes-master regularcertificates easyrsa kubernetes-worker-cpu regularcertificates easyrsa kubernetes-worker-gpu regularcertificates easyrsa kubernetes-worker-gpu8 regularcluster etcd etcd peeretcd etcd flannel regularetcd etcd kubernetes-master regularcni flannel kubernetes-master regularcni flannel kubernetes-worker-cpu regularcni flannel kubernetes-worker-gpu regularcni flannel kubernetes-worker-gpu8 regularcni kubernetes-master flannel subordinatekube-dns kubernetes-master kubernetes-worker-cpu regularkube-dns kubernetes-master kubernetes-worker-gpu regularkube-dns kubernetes-master kubernetes-worker-gpu8 regularcni kubernetes-worker-cpu flannel subordinatejuju-info kubernetes-worker-gpu cuda subordinatecni kubernetes-worker-gpu flannel subordinatejuju-info kubernetes-worker-gpu8 cuda subordinatecni kubernetes-worker-gpu8 flannel subordinate

Let us see what nvidia-smi gives us:

juju ssh kubernetes-worker-gpu/0 sudo nvidia-smiTue Feb 14 13:28:42 2017+-----------------------------------------------------------------------------+| NVIDIA-SMI 375.26 Driver Version: 375.26 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||=++==============|| 0 Tesla K80 On | 0000:00:1E.0 Off | 0 || N/A 33C P0 81W / 149W | 0MiB / 11439MiB | 95% Default |+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+| Processes: GPU Memory || GPU PID Type Process name Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+

On the more powerful 8xlarge,

juju ssh kubernetes-worker-gpu8/0 sudo nvidia-smiTue Feb 14 13:59:24 2017+-----------------------------------------------------------------------------+| NVIDIA-SMI 375.26 Driver Version: 375.26 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||=++==============|| 0 Tesla K80 On | 0000:00:17.0 Off | 0 || N/A 41C P8 31W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 1 Tesla K80 On | 0000:00:18.0 Off | 0 || N/A 36C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 2 Tesla K80 On | 0000:00:19.0 Off | 0 || N/A 44C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 3 Tesla K80 On | 0000:00:1A.0 Off | 0 || N/A 38C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 4 Tesla K80 On | 0000:00:1B.0 Off | 0 || N/A 43C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 5 Tesla K80 On | 0000:00:1C.0 Off | 0 || N/A 38C P0 69W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 6 Tesla K80 On | 0000:00:1D.0 Off | 0 || N/A 44C P0 58W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 7 Tesla K80 On | 0000:00:1E.0 Off | 0 || N/A 38C P0 71W / 149W | 0MiB / 11439MiB | 39% Default |+-------------------------------+----------------------+----------------------+

Aaaand yes!! We have our 8 GPUs as expected so 8x 12GB = 96GB Video RAM!

At this stage, we only have them enabled on the hosts. Now let us add GPU support in Kubernetes.

Adding GPU support in Kubernetes

By default, CDK will not activate GPUs when starting the API server and the Kubelets. We need to do that manually (for now).

Master Update

On the master node, update /etc/default/kube-apiserver to add:

# Security ContextKUBE_ALLOW_PRIV="--allow-privileged=true"

before restarting the API Server. This can be done programmatically with:

juju show-status kubernetes-master --format json | \jq --raw-output '.applications."kubernetes-master".units | keys[]' | \xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kube-apiserver && sudo systemctl restart kube-apiserver.service"

So now the Kube API will accept requests to run privileged containers, which are required for GPU workloads.

Worker nodes

On every worker, /etc/default/kubelet to to add the GPU tag, so it looks like:

# Security ContextKUBE_ALLOW_PRIV="--allow-privileged=true"

# Add your own!KUBELET_ARGS="--experimental-nvidia-gpus=1 --require-kubeconfig --kubeconfig=/srv/kubernetes/config --cluster-dns=10.1.0.10 --cluster-domain=cluster.local"

before restarting the service.

This can be done with

for WORKER_TYPE in gpu gpu8dojuju show-status kubernetes-worker-${WORKER_TYPE} --format json | \jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kubelet"

juju show-status kubernetes-worker-${WORKER_TYPE} --format json | \jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \xargs -I UNIT juju ssh UNIT "sudo sed -i 's/KUBELET_ARGS=\"/KUBELET_ARGS=\"--experimental-nvidia-gpus=1\ /' /etc/default/kubelet && sudo systemctl restart kubelet.service"

done

Testing our setup

Now we want to know if the cluster actually has GPU enabled. To validate, run a job with an nvidia-smi pod:

kubectl create -f src/nvidia-smi.yaml

Then wait a little bit and run the log command:

kubectl logs $(kubectl get pods -l name=nvidia-smi -o=name -a)Tue Feb 14 14:14:57 2017+-----------------------------------------------------------------------------+| NVIDIA-SMI 375.26 Driver Version: 375.26 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||=++==============|| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 || N/A 47C P0 56W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 || N/A 39C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 || N/A 48C P0 57W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 || N/A 41C P0 70W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 || N/A 47C P0 58W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 || N/A 40C P0 69W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 || N/A 48C P0 59W / 149W | 0MiB / 11439MiB | 0% Default |+-------------------------------+----------------------+----------------------+| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 || N/A 41C P0 72W / 149W | 0MiB / 11439MiB | 100% Default |+-------------------------------+----------------------+----------------------+

Ẁhat is intersting here is that the pod sees all the cards, even if we only shared the /dev/nvidia0 char device. At runtime, we would have problems. If you want to run multi GPU containers, you need to share all char devices like we do in the second yaml file (nvidia-smi-8.yaml)

Conclusion

We reached the first milestone of our 3 part journey: the cluster is up & running, GPUs are activated, and Kubernetes will now welcome GPU workloads.

If you are a data scientist or running Kubernetes workloads that could benefit of GPUs, this already gives you an elegant and very fast way of managing your setups. But usually in this context, you also need to have storage available between the instances, whether it is to share the dataset or to exchange results.

Kubernetes offers many options to connect storage. In the second part of the blog, we will see how to automate adding EFS storage to our instances, then put it to good use with some datasets!

In the meantime, feel free to contact me if you have a specific use case in the cloud for this to discuss operational details. I would be happy to help you setup you own GPU cluster and get you started for the science!

Tearing Down

Whenever you feel like it, you can tear down this cluster. These instances can be pricey, hence powering them down when you do not use them is not a bad idea.

juju kill-controller aws/us-east-1

This will ask for confirmation then destroy everything… But now, you are just a few commands and a coffee away from rebuilding it, so that is not a problem.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMIfamily. We are now accepting submissions and happy to discuss advertising &sponsorship opportunities.

To learn more, read our about page, like/message us on Facebook, or simply, tweet/DM @HackerNoon.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

GPUs & Kubernetes for Deep Learning — Part 1/3