How GPUs are Beginning to Displace Clusters for Big Data & Data Science

More recently on my data science journey I have been using a low grade consumer GPU (NVIDIA GeForce 1060) to accomplish things that were previously only realistically capable on a cluster - here is why I think this is the direction data science will go in the next 5 years.

Clusters Clusters Clusters!

Now let me preface this article by saying that I don't think GPU's will replace clusters for ALL HPC use cases, however I do think we will see a leap to a more 'GPU First' mindset over the coming years.

A standard workflow for many big-data based companies is to develop some kind of pipeline that takes in some data - mangles it in some way by combining it with other data or running some statistical analysis on it and then outputting the results to some kind of BI dashboard to produce insights.

To do this someone with experience would need to spin up and configure a cluster (Spark, MR or Similar) and provide the interface for departments to execute this on. The cluster could be anywhere from 3 nodes up to 1000+ nodes depending on the workflow.

Even at conservative estimates this gets expensive quickly and although Cloud platforms like AWS and GCP make this easier than ever before there is still a learning curve to getting it right and these platforms can cost serious money - especially when things don't go right the first time (99% completion data-skew anyone?)

GPUs in place of a traditional Cluster

The reason above it why I think many are turning to GPU's. As a developer (especially freelance) we often don't have the spare $1000's to run a proof of concept to develop a pipeline like this at real scale, everyone has access to a GPU in some form or other and after all what is a GPU really other than a self contained cluster?

NVIDIA GPU's contain chips that have what are called “CUDA Cores”, each one of these cores is a miniature processor that can execute some code.

A popular consumer GPU — the GTX 1080 Ti is illustrated below, it shows that this card has 3584 CUDA cores that can process this data in parallel. If that doesn't look like a multi-floor data-center to you then I don't know what to say.

Image courtesy of extremetech

You, reading this article right now have some form of GPU in the machine you are using, if you are the kind of person who requires any kind of graphics performance there might even be an NVIDIA card in there.

Note: although I am referring to NVIDIA in this article, other GPU's are also capable of performing the same tasks, however unfortunately the tooling isn't as mature as what NVIDIA provides with the CUDA toolkit.

Comparing performance of a Cluster with a GPU

Bear with me here as this isn't as simple as comparing 2 systems, it is to a degree apples and oranges so lets focus on the outcomes.

One tool that often needs to be completed by your average data-engineer is converting row based data to a columnar format (such as ORC or Parquet). This is also one of those tasks that can be performed on a single node right up to a 1000 node cluster with a fairly logarithmic increase in speed the more nodes that you add. We do this because columnar formats have many benefits over row based formats (this is for another article!).

Image courtesy of Cloudera

Cluster speed comparisons for converting CSV to Columnar

A big data consultant that I have followed for some time helpfully produced some benchmarks of some cluster systems for performing this very task (https://tech.marksblogg.com/faster-csv-to-orc-conversions.html)

Mark compared (what I consider) to the be the current generation of tools Hive, Presto and Spark using a 21 node cluster on AWS EMR.

He used a 100GB New York Taxi rides dataset as the comparison for his benchmark. The fastest conversion result was that of Presto (no surprise - I love Presto!) that came out at 37 minutes.

According to the AWS cost calculator this costs $430 USD per month if used for a maximum of 2 hours per day (Marks longest conversion)

GPU speed comparisons for converting CSV to Columnar

I didn't have access to the same dataset Mark had so I used a similar sized one of our own

The dataset comes in at just over 2 Billion Rows and has 41 fields. The total size of this data is 397GB uncompressed or around 127GB gzip compressed. This is about 25% larger than the dataset used for the cluster tests.

Using the very excellent RAPIDS.ai framework (https://rapids.ai - Supported by NVIDIA) I imported the rapids.ai Docker container on my QNAP NAS (32GB, i7-6700, 14TB SATA, 2TB NVMe, GeForce GTX 1060 6GB)

Then using the provided Jupyter notebook and my datasets I created a basic script that would handle the conversion:

%%time
import dask_cudf as dc

ddf = dc.read_csv('/data/Data Files/Vegas/datafiles/csv/*.csv.gz', compression='gzip')

CPU times: user 1.82 s, sys: 870 ms, total: 2.69 s
Wall time 6.99 s

%%time
ddf = ddf.repartition(npartitions=3000)

CPU times: user 60.2 ms, sys: 159 µs, total: 60.4 ms
Wall time: 57.6 ms

%%time
ddf.to_orc('/data/Data Files/Vegas/datafiles/orc/')

CPU times: user 1h 4min 4s, sys: 30min 19s, total 1h 34min 23s
Wall time: 41min 57s

This produced a total time of ~42 Minutes - not too shabby but I felt we could do better by utilising the NVMe drives I have in my NAS.

I ran the same test except this time I read and wrote back to the NVMe drives which came out at ~31 Minutes, a bit disappointing if I am honest considering the vast difference in read and write speeds. I think because I was reading and writing to a single drive it probably had some IO blocking going on - once I can get a second NVMe drive in the NAS I will try this again.

How did a single $200 GPU beat a massive 21 node cluster at this task?

Well as I said at the start the test isn't massively fair, there are deficiencies in this that with the current software are a way off testing - for example the re-partition stage isn't ideal as it produces 3000 files where columnar storage formats do better with a smaller number of files (there is active work happening on Rapids.ai to solve this however!).

But mostly I would say because the computation is actually fairly suited for a GPU in this case and by having the "nodes" located so close together and without networking or other inefficiencies introduced a lot of the overheads are taken out. I am sure there are cases that more advanced data-science guys come across where a GPU isn't as suitable (comment below as I am interested in what they are!)

Conclusion

As a developer reading this (and one writing this) I know that I would be much more likely to try out a proof of concept if I know I can do it on my local computer with no additional costs.

Analytics India recently saw a large uptick in the number of developers using GPU processing in their jobs:

Another significant change seen this year is the increase in the use of GPUs at work. While most of the data scientists still use PCs and similar models, the second-favourite product is Nvidia GeForce GTX 9 Series GPU. The number of people using it has grown from a mere 8% last year, to 28% in 2019

https://analyticsindiamag.com/data-science-skills-study-2019-by-aim-imarticus-learning/

I think as the data-science tooling for GPU's gets better and the price for GPU's reduces, even older model GPU's such as what I am using can be used to demonstrate and get executive buy-in for a GPU based strategy going forward.

If my little QNAP can produce results like this on a 4 year old GPU - imagine what the latest Tesla Turing and P100 models can produce for a few bucks an hour on any cloud provider. Then imagine putting multiple GPU's into an instance to accomplish things even faster.

We truly are entering the age of GPU data processing for the masses