Is GPU Really Necessary for Data Science Work?

A big question for Machine Learning and Deep Learning apps developers is whether or not to use a computer with a GPU, after all, GPUs are still very expensive. To get an idea, see the price of a typical GPU for processing AI in Brazil costs between US $ 1,000.00 and US $ 7,000.00 (or more).

The purpose of this tutorial is to demonstrate the need to use a GPU for Deep Learning processing, and I'll show you that you can use Java without C++ for this!

If you don't want to invest in a CUDA GPU, Amazon has appropriate instances for processing with GPU. Look at the price comparison of two configurations:

Instance	vCPUs	GPUs	RAM	Hourly price (US$)
c5.2xlarge	8	0	16 GiB	0,34
p3.2xlarge	8	1	61 GiB	3,06
p3.8xlarge	32	4	244 GiB	12,24

Anyone who has ever trained a Machine or Deep Learning model knows that using a GPU can decrease the training time from days / hours to minutes / seconds, right?

But, is it really necessary? Can't we use a cluster of cheap machines, as we do with Bigdata?

The simplest and most direct answer is: YES, GPUs are needed to train models and nothing will replace them. However, you have to program properly in order to get the best out of using GPU, and not all libraries and frameworks do this efficiently.

How the GPU works

Let's start with an analogy, adapted from what I saw on a data science training, and which I really liked.

Imagine a huge motorcycle, like ... 1000 CC ... I don't know, a Kawazaki. It's a very fast bike, right? Now, imagine that you have 8 of these bikes and you want to deliver pizza. Each motorcycle can take 1 order to the customer, so if there are more than 8 orders, someone will have to wait for one of the bikes to be available for delivery.

This is how the CPU works: Very fast and focused on sequential processing. Each core is a very fast bike. Of course, you can adapt it so that each motorcycle delivers more than one pizza at a time, but, in any case, it will be sequential processing: Deliver one pizza, deliver the next, etc.

Now, let's think you have 2000 bikes and 2000 delivery people. Although the bikes are much faster, you have a lot more bikes and can deliver multiple orders at once, avoiding queues. The slowness of the bikes is compensated by the parallelism.

GPU is parallel processing oriented!

If we compare task processing time, the CPU wins, but if we consider the parallelism, in the overall throughput, the GPU is unbeatable. That is why it is used for intensive processing tasks and calculations, such as: Virtual currency mining and Deep Learning.

How can we program for the GPU

Programming for GPU is not simple. To start, you have to consider that there is more than one GPU vendor and that there are two more well-known programming frameworks:

CUDA: Compute Unified Device Architecture, from Nvidia chips;
OpenCL: Used in GPUs from other vendors, such as AMD.

The CUDA programming interface is made in C, but there are bindings for Python, like PyCuda and for Java, like JCuda. But they are a little more difficult to learn and program.

And you need to understand the CUDA platform well, as well as its individual components, such as cuDNN or cuBLAS.

However, there are easier and more interesting alternatives that use the GPU, such as Deeplearning4J and its associated project, ND4J. ND4J is like the numpy of Java, only with steroids! It is capable of allowing you to use the available GPU (s) in a simple and practical way, and that is what we will use in this tutorial.

First of all

You must have an NVidia GPU on your device, with the appropriate drivers installed. Find out which GPU you have. Then, make sure you have installed the correct Nvidia driver. Then, install the CUDA Toolkit. If everything is correct, you can run the command below:

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce MX110       On   | 00000000:01:00.0 Off |                  N/A |
| N/A   50C    P0    N/A /  N/A |    666MiB /  2004MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1078      G   /usr/lib/xorg/Xorg                           302MiB |
|    0      1979      G   /usr/bin/gnome-shell                         125MiB |
|    0      2422      G   ...quest-channel-token=7684432457067004974   108MiB |
|    0     19488      G   ...-token=7864D1BD51E7DFBD5D19F40F0E37669D    47MiB |
|    0     20879      G   ...-token=8B052333281BD2F7FF0CBFF6F185BA98     1MiB |
|    0     24967      G   ...-token=62FCB4B2D2AE1DC66B4AF1A0693122BE    40MiB |
|    0     25379      G   ...equest-channel-token=587023958284958671    35MiB |
+-----------------------------------------------------------------------------+

AI jobs

What is an AI job? Deep Learning? It is based on two complex mathematical operations:

Feedforward: Basically the linear combination of the weight matrices with the values in each layer, from the entry to the end;
Backpropagation: Differential calculation of each gradient of each neuron (including BIAS), from the last layer to the beginning, in order to adjust the weights. Feedforward is repeated for each record in the input set and multiplied by the number of iterations or epochs we want to train, that is, many times. And Backpropagation can be done at the same frequency, or at regular intervals, depending on the learning algorithm used.

In summary: Vector calculations and differentials of simultaneous multiple values.

That is why GPUs are necessary for development, training and also for inferences, depending on the complexity of the model.

Demonstration

The project for this tutorial is a Java application that performs matrices multiplication, a common operation in deep learning jobs. It multiplies the matrices only once, first on the CPU, then on the GPU (using ND4J and CUDA Toolkit). Note that it is not even a model of machine learning, but just a single basic operation.

The pom.xml file configures the ND4J to use the GPU with the CUDA platform:

<dependency>
    <groupId>org.nd4j</groupId>
    <artifactId>nd4j-cuda-10.1</artifactId>
    <version>1.0.0-beta4</version>
</dependency>

The main class: MatMul is a simple application, which defines two matrices and calculates their product, first on the CPU, then on the GPU, using the ND4J.

I'm working with 2 arrays of 500 x 300 and 300 x 400, nothing much for a typical neural network.

My laptop is an I7, eighth generation, and has an Nvidia MX110 chipset, which is very "entry level", with 256 colors and Cuda Capability 5, that is, nothing much ... A K80 card has more than 3,500 colors and cuda capability 8 or higher.

Let's see the application execution:

CPU Interativo 	(nanossegundos): 111.203.589

...
1759 [main] INFO org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner  - Device Name: [GeForce MX110]; CC: [5.0]; Total/free memory: [2101870592]


GPU Paralelo 	(nanossegundos): 9.905.426


Percentual de tempo no cálculo da GPU com relação à CPU: 8.907469704057842

Ok, the application text is still in Portuguese, but I'll provide a quick translation:

"CPU Interativo (nanossegundos)": Iteractive CPU (nanoseconds);
"GPU Paralelo (nanossegundos)": Parallel GPU (nanoseconds);
"Percentual de tempo no cálculo da GPU com relação à CPU" Time percent of GPU over CPU: 8,9%;

Conclusion

Even using an entry level GPU like mine, the matrix scalar product ran on the GPU took only 8.9% of the time that it ran on the CPU. An abysmal difference. Check it out:

CPU time: 111,203,589 nanoseconds;
GPU time: 9,905,426 nanoseconds.

Considering that the matrices product is only ONE operation, and that Feedforward involves this operation thousands of times, it is reasonable to believe that this difference must be much greater, if we were really training a neural network.

And there is no point in clustering or RDMA, because nothing, NOTHING is able to match the performance of a single GPU.

Well, I hope I have demonstrated two things here: GPU is essential and how we can use it directly from a Java application. If you want, you can even convert that MLP model we made to run on the GPU impress your boss (or boyfriend / girlfriend).