Speed up sate-of-the-art ViT models in Hugging Face 🤗 up to 2300% (25x times faster ) with Databricks, Nvidia, and Spark NLP 🚀

I am one of the contributors to the Spark NLP open-source project and just recently this library started supporting end-to-end Vision Transformers (ViT) models. I use Spark NLP and other ML/DL open-source libraries for work daily and I have decided to deploy a ViT pipeline for a state-of-the-art image classification task and provide in-depth comparisons between Hugging Face and Spark NLP.

The purpose of this article is to demonstrate how to scale out Vision Transformer (ViT) models from Hugging Face and deploy them in production-ready environments for accelerated and high-performance inference. By the end, we will scale a ViT model from Hugging Face by 25x times (2300%) by using Databricks, Nvidia, and Spark NLP.

In this article I will:

A short introduction to Vision Transformer (ViT)

Benchmark Hugging Face inside Dell server on CPUs & GPUs

Benchmark Spark NLP inside Dell server on CPUs & GPUs

Benchmark Hugging Face inside Databricks Single Node on CPUs & GPUs

Benchmark Spark NLP inside Databricks Single Node on CPUs & GPUs

Benchmark Spark NLP inside Databricks scaled to 10x Nodes with CPUs & GPUs

Sum up everything!

In the spirit of full transparency, all the notebooks with their logs, screenshots, and even the excel sheet with numbers are provided here on GitHub

Introduction to Vision Transformer (ViT) models

Back in 2017, a group of researchers at Google AI published a paper that introduced a transformer model architecture that changed all Natural Language Processing (NLP) standards. The paper describes a novel mechanism called self-attention as a new and more efficient model for language applications. For instance, the two of the most popular families of transformer-based models are GPT and BERT.

A bit of Transformer history https://huggingface.co/course/chapter1/4

There is a great chapter about “How Transformers Work” which I highly recommend for reading if you are interested.

Although these new Transformer-based models seem to be revolutionizing NLP tasks, their usage in Computer Vision (CV) remained pretty much limited. The field of Computer Vision has been dominated by the usage of convolutional neural networks (CNNs) and there are popular architectures based on CNNs (like ResNet). This had been the case until another team of researchers this time at Google Brain introduced the “Vision Transformer” (ViT) in June 2021 in a paper titled: “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

This paper represents a breakthrough when it comes to image recognition by using the same self-attention mechanism used in transformer-based models such as BERT and GPT as we just discussed. In Transformed-based language models like BERT, the input is a sentence (for instance a list of words). However, in ViT models we first split an image into a grid of sub-image patches, we then embed each patch with a linear project before having each embedded patch become a token. The result is a sequence of embeddings patches which we pass to the model similar to BERT.

An overview of the ViT model structure as introduced in Google Research’s original 2021 paper

Vision Transformer focuses on higher accuracy but with less compute time. Looking at the benchmarks published in the paper, we can see the training time against the Noisy Student dataset (published by Google in Jun 2020) has been decreased by 80% even though the accuracy state is more or less the same. For more information regarding the ViT performance today you should visit its page on Papers With Code:

Comparison with state of the art on popular image classification benchmarks. (https://arxiv.org/pdf/2010.11929.pdf)

It is also important to mention that once you have trained a model via ViT architecture, you can pre-train and fine-tune your transformer just as you do in NLP. (that’s pretty cool actually!)

If we compare ViT models to CNNs we can see that they have higher accuracy with much lower cost for computations. You can use ViT models for a variety of downstream tasks in Computer Vision like image classification, detecting objects, and image segmentation. This can be also domain-specific in Healthcare you can pre-train/fine-tune your ViT models for femur fractures, emphysema, breast cancer, COVID-19, and Alzheimer’s disease.¹

I will leave references at the end of this article just in case you want to dig deeper into how ViT models work.

[1]: Deep Dive: Vision Transformers On Hugging Face Optimum Graphcore https://huggingface.co/blog/vision-transformers

Some ViT models in action

Vision Transformer (ViT) model (vit-base-patch16–224) pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224:

https://huggingface.co/google/vit-base-patch16-224

Fine-tuned ViT models used for food classification:

https://huggingface.co/nateraw/food — https://huggingface.co/julien-c/hotdog-not-hotdog

There are however limitations & restrictions to any DL/ML models when it comes to prediction. There is no model with 100% accuracy so keep in mind when you are using them for something important like Healthcare:

Image is taken from: https://www.akc.org/expert-advice/lifestyle/do-you-live-in-dog-state-or-cat-state/ — ViT model: https://huggingface.co/julien-c/hotdog-not-hotdog

Can we use these models from Hugging Face or fine-tune new ViT models and use them for inference in real production? How can we scale them by using managed services for distributed computations such as AWS EMR, Azure Insight, GCP Dataproc, or Databricks?

Hopefully, some of these will be answered by the end of this article.

Let the Benchmarks Begin!

Some details about our benchmarks:

1- Dataset: ImageNet mini: sample (>3K) — full (>34K)

I have downloaded ImageNet 1000 (mini) dataset from Kaggle: https://www.kaggle.com/datasets/ifigotin/imagenetmini-1000

I have chosen the train directory with over 34K images and called it imagenet-mini since all I needed was enough images to do benchmarks that take longer. In addition, I have randomly selected less than 10% of the full dataset and called it imagenet-mini-sample which has 3544 images for my smaller benchmarks and also to fine-tune the right parameters like the batch size.

2- Model: The “vit-base-patch16–224” by Google

We will be using this model from Google hosted on Hugging Face: https://huggingface.co/google/vit-base-patch16-224

3- Libraries: Transformers 🤗 & Spark NLP 🚀

Benchmarking Hugging Face on a Bare Metal Server

ViT model on a Dell PowerEdge C4130

What is a bare-metal server? A bare-metal server is just a physical computer that is only being used by one user. There is no hypervisor installed on this machine, there are no virtualizations, and everything is being executed directly on the main OS (Linux — Ubuntu) — the detailed specs of CPUs, GPUs, and the memory of this machine are inside the notebooks.

As my initial tests plus almost every blog post written by the Hugging Face engineering team comparing inference speed among DL engines have revealed, the best performance for inference in the Hugging Face library (Transformer) is achieved by using PyTorch over TensorFlow. I am not sure whether this is due to TensorFlow being a second-class citizen in Hugging Face due to fewer supported features, fewer supported models, fewer examples, outdated tutorials, and yearly surveys for the last 2 years answered by users asking more for TensorFlow or PyTorch just has a lower latency in inference on both CPU and GPU.

TensorFlow remains the most-used deep learning framework

Regardless of the reason, I have chosen PyTorch in the Hugging Face library to get the best results for our image classification benchmarks. This is a simple code snippet to use a ViT model (PyTorch of course) in Hugging Face:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

This may look straightforward to predict an image as an input, but it is not suitable for larger amounts of images, especially on a GPU. To avoid predicting images sequentially and to take advantage of accelerated hardware such as GPU is best to feed the model with batches of images which is possible in Hugging Face via Pipelines. Needless to say, you can implement your batching technique either by extending Hugging Face’s Pipelines or doing it on your own.

A simple pipeline for Image Classification will look like this:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from transformers import pipeline
    
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=-1)

As per documentation, I have downloaded/loaded google/vit-base-patch16–224 for the feature extractor and model (PyTorch checkpoints of course) to use them in the pipeline with image classification as the task. There are 3 things in this pipeline that is important to our benchmarks:

> device: If it’s -1 (default) it will only use CPUs while if it’s a positive int number it will run the model on the associated CUDA device id.(it’s best to hide the GPUs and force PyTorch to use CPU and not just rely on this number here).

> batch_size: When the pipeline will use DataLoader (when passing a dataset, on GPU for a Pytorch model), the size of the batch to use, for inference is not always beneficial.

> You have to use either DataLoader or PyTorch Dataset to take full advantage of batching in Hugging Face pipelines on a GPU.

Before we move forward with the benchmarks, you need to know one thing regarding the batching in Hugging Face Pipelines for inference, that it doesn’t always work. As it is stated in Hugging Face’s documentation, setting batch_size may not increase the performance of your pipeline at all. It may slow down your pipeline:

https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching

To be fair, in my benchmarks I used a range of batch sizes starting from 1 to make sure I can find the best result among them. This is how I benchmarked the Hugging Face pipeline on CPU:

from transformers import pipeline

pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=-1)
    
for batch_size in [1, 8, 32, 64, 128]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
        pass

Let’s have a look at the results of our very first benchmark for the Hugging Face image classification pipeline on CPUs over the sample (3K) ImageNet dataset:

Hugging Face image-classification pipeline on CPUs — predicting 3544 images

As it can be seen, it took around 3 minutes (188 seconds) to finish processing around 3544 images from the sample dataset. Now that I know which batch size (8) is the best for my pipeline/dataset/hardware, I can use the same pipeline over a larger dataset (34K images) with this batch size:

Hugging Face image-classification pipeline on CPUs — predicting 34745 images

This time it took around 31 minutes (1,879 seconds) to finish predicting classes for 34745 images on CPUs.

To improve most deep learning models, especially these new transformer-based models, one should use accelerated hardware such as GPU. Let’s have a look at how to benchmark the very same pipeline over the very same datasets but this time on a GPU device. As mentioned before, we need to change the device to a CUDA device id like 0 (the first GPU):

from transformers import ViTFeatureExtractor, ViTForImageClassification
from transformers import pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
model = model.to(device)

pipe = pipeline("image-classification", model=model, feature_extractor=feature_extractor, device=0)

for batch_size in [1, 8, 32, 64, 128, 256, 512, 1024]:
    print("-" * 30)
    print(f"Streaming batch_size={batch_size}")
    for out in tqdm(pipe(dataset, batch_size=batch_size), total=len(dataset)):
        pass

In addition to setting device=0, I also followed the recommended way to run a PyTorch model on a GPU device via .to(device). Since we are using accelerated hardware (GPU) I also increased the maximum batch size for my testings to 1024 to find the best result.

Let’s have a look at our Hugging Face image classification pipeline on a GPU device over the sample ImageNet dataset (3K):

Hugging Face image-classification pipeline on a GPU — predicting 3544 images

As it can be seen, it took around 50 seconds to finish processing around 3544 images from our imagenet-mini-sample dataset on a GPU device. The batching improved the speed especially compare to the results coming from the CPUs, however, the improvements stopped around the batch size of 32. Although the results are the same after batch size 32, I have chosen batch size 256 for my larger benchmark to utilize enough GPU memory as well.

Hugging Face image-classification pipeline on a GPU — predicting 34745 images

This time our benchmark took around 8:17 minutes (497 seconds) to finish predicting classes for 34745 images on a GPU device. If we compare the results from our benchmarks on CPUs and a GPU device we can see that the GPU here is the winner:

Hugging Face (PyTorch) is up to 3.9x times faster on GPU vs. CPU

I used Hugging Face Pipelines to load ViT PyTorch checkpoints, load my data into the torch dataset, and use out-of-the-box provided batching to the model on both CPU and GPU. The GPU is up to ~3.9x times faster compared to running the same pipelines on CPUs.

We have improved our ViT pipeline to perform image classification by using a GPU device instead of CPUs, but can we improve our pipeline further on both CPU & GPU in a single machine before scaling it out to multiple machines? Let’s have a look at the Spark NLP library.

Spark NLP: State-of-the-Art Natural Language Processing

Spark NLP is an open-source state-of-the-art Natural Language Processing library (https://github.com/JohnSnowLabs/spark-nlp)

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 7000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization & Question Answering, Text Generation, Image Classification (ViT), and many more NLP tasks.

Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, CamemBERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, GPT2, and Vision Transformer (ViT) not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.

Benchmarking Spark NLP on a Bare Metal Server

ViT models on a Dell PowerEdge C4130

Spark NLP has the same ViT features for Image Classification as Hugging Face which were added in the recent 4.1.0 release. The feature is called ViTForImageClassification, it has over 240 pre-trained models ready to go, and a simple code to use this feature in Spark NLP looks like this:

from sparknlp.annotator import *
from sparknlp.base import *

from pyspark.ml import Pipeline

imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")
    
imageClassifier = ViTForImageClassification \
    .pretrained("image_classifier_vit_base_patch16_224") \
    .setInputCols("image_assembler") \
    .setOutputCol("class") \
    .setBatchSize(8)

pipeline = Pipeline(stages=[
    imageAssembler,
    imageClassifier
])

If we compare Spark NLP and Hugging Face side by side for downloading and loading a pre-trained ViT model for an Image Classification prediction, apart from loading images and using post calculations like argmax outside the Hugging Face library, they are both pretty straightforward. Also, they both can be saved and serve later as a pipeline to reduce these lines into only 1 line of code:

Loading and using ViT models for Image Classification in Spark NLP (left) and Hugging Face (right)

Since Apache Spark has a concept called Lazy Evaluation it doesn’t start the execution of the process until an ACTION is called. Actions in Apache Spark can be .count() or .show() or .write() and so many other RDD-based operations which I won’t get into it now and you won’t need to know them for this article. I usually choose either count() the target column or write() the results on disks to trigger executing all the rows in the DataFrame. Also, like Hugging Face benchmarks, I will loop through selected batch sizes to make sure I can have all the possible results without missing the best outcome.

Now, we know how to load ViT model(s) in Spark NLP, we also know how to trigger an action to force computation over all the rows in our DataFrame to benchmark, and all that is left to learn is oneDNN from oneAPI Deep Neural Network Library (oneDNN). Since the DL engine in Spark NLP is TensorFlow, you can also enable oneDNN to improve the speed on CPUs (like everything else, you need to test this to be sure it improves the speed and not the other way around). I will also be using this flag in addition to normal CPUs without oneDNN enabled

Now that we know all the ViT models from Hugging Face are also available in Spark NLP and how to use them in a pipeline, we will repeat our previous benchmarks on the bare-metal Dell server to compare CPU vs. GPU. Let’s have a look at the results of Spark NLP’s image classification pipeline on CPUs over our sample (3K) ImageNet dataset:

Spark NLPimage-classification pipeline on a CPU without oneDNN — predicting 3544 images

It took around 2.1 minutes (130 seconds) to finish processing around 3544 images from our sample dataset. Having a smaller dataset to try different batch sizes is helpful to choose the right batch size for your task, your dataset, and your machine. Here is clear that batch size 16 is the best size for our pipeline to deliver the best result.

I would like to also enable oneDNN to see if in this specific situation it improves my benchmark compare to the CPUs without oneDNN. You can enable oneDNN in Spark NLP by setting the environment variable of TF_ENABLE_ONEDNN_OPTS to 1. Let’s see what happens if I enable this flag and re-run the previous benchmark on the CPU to find the best batch size:

Spark NLPimage-classification pipeline on a CPU with oneDNN — predicting 3544 images

OK, so clearly enabling oneDNN for TensorFlow in this specific situation improved our results by at least 14%. Since we don’t have to do/change anything and all it takes is to say export TF_ENABLE_ONEDNN_OPTS=1 I am going to use that for the benchmark with a larger dataset as well to see the difference. Here is around seconds faster, but 14% on the larger dataset can shave off minutes of our results.

Now that I know the batch size of 16 for CPU without oneDNN and batch size of 2 for CPU with oneDNN enabled have the best results I can continue with using the same pipeline over a larger dataset (34K images):

Spark NLP image-classification pipeline on CPUs without oneDNN — predicting 34745 images

This time our benchmark took around 24 minutes (1423 seconds) to finish predicting classes for 34745 images on a CPU device without oneDNN enabled. Now let’s see what happens if I enable oneDNN for TensorFlow and use the batch size of 2 (the best results):

Spark NLP image-classification pipeline on CPUs with oneDNN — predicting 34745 images

This time it took around 21 minutes (1278 seconds). As expected from our sample benchmarks, we can see around 11% improvements in the results which did shave off minutes compared to not having oneDNN enabled.

Let’s have a look at how to benchmark the very same pipeline on a GPU device. In Spark NLP, all you need to use GPU is to start it with gpu=True when you are starting the Spark NLP session:

spark = sparknlp.start(gpu=True)
# you can set the memory here as well
spark = sparknlp.start(gpu=True, memory="16g")

That’s it! If you have something in your pipeline that can be run on GPU it will do it automatically without the need to do anything explicitly.

Let’s have a look at our Spark NLP image classification pipeline on a GPU device over the sample ImageNet dataset (3K):

Spark NLPimage-classification pipeline on a GPU — predicting 3544 images

Out of curiosity to see whether my crusade to find a good batch size on a smaller dataset was correct I ran the same pipeline with GPU on a larger dataset to see if the batch size 32 will have the best result:

Spark NLP image-classification pipeline on a GPU — predicting 34745 images

Thankfully, it is batch size 32 that yields the best time. So it took around 4 and a half minutes (277 seconds).

I will pick the results from CPUs with oneDNN since they were faster and I will compare them to the GPU results:

Spark NLP (TensorFlow) is up to 4.6x times faster on GPU vs. CPU (oneDNN)

This is great! We can see Spark NLP on GPU is up to 4.6x times faster than CPUs even with oneDNN enabled.

Let’s have a look at how these results are compared to Hugging Face benchmarks:

Spark NLP is 65% faster than Hugging Face on CPUs in predicting image classes for the sample dataset with 3K images and 47% on the larger dataset with 34K images. Spark NLP is also 79% faster than Hugging Face on a single GPU inference larger dataset with 34K images and up to 35% faster on a smaller dataset.

Spark NLP was faster than Hugging Face in a single machine by using either CPU or GPU — image classification by using Vision Transformer (ViT)

Spark NLP & Hugging Face on Databricks

What is Databricks? All your data, analytics, and AI on one platform

Databricks is a Cloud-based platform with a set of data engineering & data science tools that are widely used by many companies to process and transform large amounts of data. Users use Databricks for many purposes from processing and transforming extensive amounts of data to running many ML/DL pipelines to explore the data.

Disclaimer: This was my interpretation of Databricks, it does come with lots of other features and you should check them out: https://www.databricks.com/product/data-lakehouse

Databricks supports AWS, Azure, and GCP clouds: https://www.databricks.com/product/data-lakehouse

Hugging Face in Databricks Single Node with CPUs on AWS

Databricks offers a “Single Node” cluster type when you are creating a cluster that is suitable for those who want to use Apache Spark with only 1 machine or use non-spark applications, especially ML and DL-based Python libraries. Hugging Face comes already installed when you choose Databricks 11.1 ML runtime. Here is what the cluster configurations look like for my Single Node Databricks (only CPUs) before we start our benchmarks:

Databricks single-node cluster — CPU runtime

The summary of this cluster that uses m5n.8xlarge instance on AWS is that it has 1 Driver (only 1 node), 128 GB of memory, 32 Cores of CPU, and it costs 5.71 DBU per hour. You can read about “DBU” on AWS here: https://www.databricks.com/product/aws-pricing

Databricks single-cluster — AWS instance profile

Let’s replicate our benchmarks from the previous section (bare-metal Dell server) here on our single-node Databricks (CPUs only). We start with Hugging Face and our sample-sized dataset of ImageNet to find out what batch size is a good one so we can use it for the larger dataset since this happened to be a proven practice in the previous benchmarks:

Hugging Face image-classification pipeline on Databricks single-node CPUs — predicting 3544 images

It took around 2 minutes and a half (149 seconds) to finish processing around 3544 images from our sample dataset on a single-node Databricks that only uses CPUs. The best batch size on this machine using only CPUs is 8 so I am gonna use that to run the benchmark on the larger dataset:

Hugging Face image-classification pipeline on Databricks single-node CPUs — predicting 34745 images

On the larger dataset with over 34K images, it took around 20 minutes and a half (1233 seconds) to finish predicting classes for those images. For our next benchmark we need to have a single-node Databricks cluster, but this time we need to have a GPU-based runtime and choose a GPU-based AWS instance.

Hugging Face in Databricks Single Node with a GPU on AWS

Let’s create a new cluster and this time we are going to choose a runtime with GPU which in this case is called 11.1 ML (includes Apache Spark 3.3.0, GPU, Scala 2.12) and it comes with all required CUDA and NVIDIA software installed. The next thing we need is to also select an AWS instance that has a GPU and I have chosen g4dn.8xlarge that has 1 GPU and a similar number of cores/memory as the other cluster. This GPU instance comes with a Tesla T4 and 16 GB memory (15 GB usable GPU memory).

Databricks single-node cluster — GPU runtime

This is the summary of our single-node cluster like the previous one and it is the same in terms of the number of cores and the amount of memory, but it comes with a Tesla T4 GPU:

Databricks single-node cluster — AWS instance profile

Now that we have a single-node cluster with a GPU we can continue our benchmarks to see how Hugging Face performs on this machine in Databricks. I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine:

Hugging Face image-classification pipeline on Databricks single-node CPU — predicting 3544 images

It took around a minute (64 seconds) to finish processing around 3544 images from our sample dataset on our single-node Databricks cluster with a GPU device. The batching improved the speed if we look at batch size 1 result, however, after batch size 8 the results pretty much stayed the same. Although the results are the same after batch size 8, I have chosen batch size 256 for my larger benchmark to utilize more GPU memory as well. (to be honest, 8 and 256 both performed pretty much the same)

Let’s run the benchmark on the larger dataset and see what happens with batch size 256:

Hugging Face image-classification pipeline on Databricks single-node CPU — predicting 34745 images

On a larger dataset, it took almost 11 minutes (659 seconds) to finish predicting classes for over 34K images. If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner:

Hugging Face (PyTorch) is up to 2.3x times faster on GPU vs. CPU

The GPU is up to ~2.3x times faster compared to running the same pipeline on CPUs in Hugging Face on Databricks Single Node

Now we are going to run the same benchmarks by using Spark NLP in the same clusters and over the same datasets to compare it with Hugging Face.

Benchmarking Spark NLP on a Single Node Databricks

First, let’s install Spark NLP in your Single Node Databricks CPUs:

In the Libraries tab inside your cluster you need to follow these steps:
— Install New -> PyPI -> spark-nlp==4.1.0 -> Install
— Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0 -> Install
— Will add `TF_ENABLE_ONEDNN_OPTS=1` to `Cluster->Advacend Options->Spark->Environment variables` to enable oneDNN

How to install Spark NLP in Databricks on CPUs for Python, Scala, and Java

Spark NLP in Databricks Single Node with CPUs on AWS

Now that we have Spark NLP installed on our Databricks single-node cluster we can repeat the benchmarks for a sample and full datasets on both CPU and GPU. Let’s start with the benchmark on CPUs first over the sample dataset:

Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) — predicting 3544 images

It took around 2 minutes (111 seconds) to finish processing 3544 images and predicting their classes on the same single-node Databricks cluster with CPUs we used for Hugging Face. We can see that the batch size of 16 has the best result so I will use this in the next benchmark on the larger dataset:

Spark NLP image-classification pipeline on Databricks single-node CPUs (oneDNN) — predicting 34742 images

On the larger dataset with over 34K images, it took around 18 minutes (1072 seconds) to finish predicting classes for those images. Next up, I will repeat the same benchmarks on the cluster with GPU.

Databricks Single Node with a GPU on AWS

First, install Spark NLP in your Single Node Databricks GPU (the only difference is the use of “spark-nlp-gpu” from Maven):

Install Spark NLP in your Databricks cluster
— In the Libraries tab inside the cluster you need to follow these steps:
— Install New -> PyPI -> spark-nlp==4.1.0 -> Install
— Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0 -> Install

How to install Spark NLP in Databricks on GPUs for Python, Scala, and Java

I am going to run the benchmark on the smaller dataset to see which batch size is more suited for our GPU-based machine:

Spark NLP image-classification pipeline on Databricks single-node GPU — predicting 3544 images

It took less than a minute (47 seconds) to finish processing around 3544 images from our sample dataset on our single-node Databricks with a GPU device. We can see that batch size 8 performed the best in this specific use case so I will run the benchmark on the larger dataset:

Spark NLP image-classification pipeline on Databricks single-node GPU — predicting 34742 images

On a larger dataset, it took almost 7 minutes and a half (435 seconds) to finish predicting classes for over 34K images. If we compare the results from our benchmarks on a single node with CPUs and a single node that comes with 1 GPU we can see that the GPU node here is the winner:

Spark NLP is up to 2.5x times faster on GPU vs. CPU in Databricks Single Node

This is great! We can see Spark NLP on GPU is up to 2.5x times faster than CPUs even with oneDNN enabled (oneDNN improves results on CPUs between 10% to 20%).

Let’s have a look at how these results are compared to Hugging Face benchmarks in the same Databricks Single Node cluster:

Spark NLP is up to 15% faster than Hugging Face on CPUs in predicting image classes for the sample dataset with 3K images and up to 34% on the larger dataset with 34K images. Spark NLP is also 51% faster than Hugging Face on a single GPU for a larger dataset with 34K images and up to 36% faster on a smaller dataset with 3K images.

Spark NLP is faster on both CPUs and GPUs vs. Hugging Face in Databricks Single Node

Scaling beyond a single machine

So far we established that Hugging Face on GPU is faster than the Hugging Face on CPUs on a bare-metal server and Databricks Single Node. This is what you expect when you are comparing GPU vs. CPU with these new transformer-based models.

We have also established that Spark NLP outperforms Hugging Face for the very same pipeline (ViT model), on the very same datasets, in both bare-metal server and Databricks single node cluster, and it performs better on both CPU and GPU devices. This on the other hand was not something I expected. When I was preparing this article I expected TensorFlow inference in Spark NLP to be slightly slower than inference in Hugging Face by using PyTorch or at least be neck and neck. I was aiming for this section, scaling the pipeline beyond a single machine. But it seems Spark NLP is faster than Hugging Face even in a single machine, on both CPU and GPU, over both small and large datasets.

Question: What if you want to make your ViT pipeline even faster? What if you have even larger datasets and you just cannot fit them inside one machine or it just takes too long to get the results back?

Answer: Scaling out! This means instead of resizing the same machine, add more machines to your cluster. You need something to manage all those jobs/tasks/scheduling DAGs/manage failed tasks/etc. and those have their overheads, but if you need something to be faster or to be possible (beyond a single machine) you have to use some sort of distributed system.

Scaling up = making your machine bigger or faster so that it can handle more load.
Scaling out = adding more machines in parallel to spread out a load.

Scaling out Hugging Face:

Looking at the page on Hugging Face’s official Website suggests scaling inference is only possible by using Multi-GPUs. As we describe what scaling out is, this is still stuck in a single machine:

https://huggingface.co/docs/transformers/performance

Also, not to mention that the Multi-GPUs solution for inference in Hugging Face doesn’t exist at the moment:

https://huggingface.co/docs/transformers/perf_infer_gpu_many

So it seems there is no native/official way to scale out Hugging Face pipelines. You can implement your architecture consisting of some microservices such as a job queue, messaging protocols, RESTful APIs backend, and some other required components to distribute each request over different machines, but this scales the requests by individual users instead of scaling out the actual system itself.

In addition, the latency of such systems is not comparable with natively distributed systems such as Apache Spark (gRPC might lower this latency, but still not competitive). Not to mention the single point of failure issue, managing failed jobs/tasks/inputs, and hundreds of other features you get out-of-the-box from Apache Spark that now you have to implement/maintain by yourself.

There is a blog post on the Hugging Face Website portraying the very same architecture by scaling REST endpoints to serve more users: “Deploying 🤗 ViT on Kubernetes with TF Serving” — I believe other companies are using similar approaches to scale out Hugging Face, however, they are all scaling the number of users/requests hitting the inference REST endpoints. In addition, you cannot scale Hugging Face this way on Databricks.

For instance, inference inside fastAPI is 10x times slower than local inference: https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c

Once Hugging Face offers some native solutions to scale out I will re-run the benchmarks again. Until then, there is no scaling out when you have to loop through the dataset from a single machine to hit REST endpoints in a round-robin algorithm. (think again about the part we batched rows/sequences/images to feed the GPU all at once, then you’ll get it)

Scaling out Spark NLP:

Spark NLP is an extension of Spark ML therefore it scales natively and seamlessly over all supported platforms by Apache Spark such as (and not limited) Databricks, AWS EMR, Azure Insight, GCP Dataproc, Cloudera, SageMaker, Kubernetes, and many more.

Zero code changes are needed! Spark NLP can scale from a single machine to an infinite number of machines without changing anything in the code!

You also don’t need to export any models out of Spark NLP to use it in an entirely different library to speed up or scale the inference.

Spark NLP ecosystem: optimized, tested, and supported integrations

Databricks Multi-Node with CPUs on AWS

Let’s create a cluster and this time we choose Standard inside Cluster mode. This means we can have more than 1 node in our cluster which in Apache Spark terminology it means 1 Driver and N number of Workers (Executors).

We also need to install Spark NLP in this new cluster via the Libraries tab. You can follow the steps I mentioned in the previous section for Single Node Databricks with CPUs. As you can see, I have chosen the same CPU-baed AWS instance I used to benchmark both Hugging Face and Spark NLP so we can see how it scales out when we add more nodes.

This is what our Cluster configurations look like:

Databricks multi-node (standard) cluster with only CPUs

I will reuse the same Spark NLP pipeline I used in previous benchmarks (no need to change any code) and also I will only use the larger dataset with 34K images. Let’s begin!

Scale Spark NLP on CPUs with 2x nodes

Databricks with 2x Nodes — CPUs only

Let’s just add 1 more node and make the total of the machines that will do the processing to 2 machines. Let’s not forget the beauty of Spark NLP when you go from a single machine setup (your Colab, Kaggle, Databricks Single Node, or even your local Jupyter notebook) to a multi-node cluster setup (Databricks, EMR, GCP, Azure, Cloudera, YARN, Kubernetes, etc.), zero-code change is required! And I mean zero! With that in mind, I will run the same benchmark inside this new cluster on the larger datasets with 34K images:

Spark NLP image-classification pipeline on 2x nodes with CPUs (oneDNN) — predicting 34742 images

It took around 9 minutes (550 seconds) to finish predicting classes for 34K images. Let’s compare this result on 2x Nodes with Spark NLP and Hugging Face results on Databricks single node (I will keep repeating the Hugging Face results on a Single Node as a reference since Hugging Face could not be scaled out on multiple machines, especially on Databricks):

Spark NLP is 124% faster than Hugging Face with 2x Nodes

Previously, Spark NLP beat Hugging Face on a Single Node Databricks cluster by using only CPUs by 15%.

This time, by having only 2x nodes instead of 1 node, Spark NLP finished the process of over 34K images 124% faster than Hugging Face.Scale Spark NLP on CPUs with 4x nodes

Let’s double the size of our cluster like before and go from 2x Nodes to 4x Nodes. This is how the cluster would look like with 4x nodes:

Databricks with 4x Nodes — CPUs only

I will run the same benchmark on this new cluster on the larger datasets with 34K images:

Spark NLP image-classification pipeline on 4x nodes with CPUs (oneDNN) — predicting 34742 images

It took around 5 minutes (289 seconds) to finish predicting classes for 34K images. Let’s compare this result on 4x Nodes with Spark NLP vs. Hugging Face on CPUs on Databricks:

Spark NLP is 327% faster than Hugging Face with 4x Nodes

As it can be seen, Spark NLP is now 327% faster than Hugging Face on CPUs while using only 4x Nodes in Databricks.

Scale Spark NLP on CPUs with 8x nodes

Now let’s double the previous cluster by adding 4x more Nodes and make the total of 8x Nodes. This resizing the cluster by the way is pretty easy, you just increase the number of workers in your cluster configurations:

Resizing Spark Cluster in Databricks

Databricks with 8x Nodes — CPUs only

Let’s run the same benchmark this time on 8x Nodes:

Spark NLP image-classification pipeline on 8x nodes with CPUs (oneDNN) — predicting 34742 images

It took over 2 minutes and a half (161 seconds) to finish predicting classes for 34K images. Let’s compare this result on 8x Nodes with Spark NLP vs. Hugging Face on CPUs on Databricks:

Spark NLP is 666% faster than Hugging Face with 8x Nodes

As it can be seen, Spark NLP is now 666% faster than Hugging Face on CPUs while using only 8x Nodes in Databricks.

Let’s just ignore the number of 6s here! (it was 665.8% if it makes you feel better)

Scale Spark NLP on CPUs with 10x nodes

To finish our scaling out ViT models predictions on CPUs in Databricks by using Spark NLP I will resize the cluster one more time and increase it to 10x Nodes:

Databricks with 10x Nodes — CPUs only

Let’s run the same benchmark this time on 10x Nodes:

Spark NLP image-classification pipeline on 10x nodes with CPUs (oneDNN) — predicting 34742 images

It took less than 2 minutes (112 seconds) to finish predicting classes for 34K images. Let’s compare this result on 10x Nodes with all the previous results from Spark NLP vs. Hugging Face on CPUs on Databricks:

Spark NLP is 1000% faster than Hugging Face with 10x Nodes

And this is how you scale out the Vision Transformer model coming from Hugging Face on 10x Nodes by using Spark NLP in Databricks! Our pipeline now is 1000% faster than Hugging Face on CPUs.

We managed to make our ViT pipeline 1000% faster than Hugging Face which is stuck in 1 single node by simply using Spark NLP, but we only used CPUs. Let’s see if we can get the same improvements by scaling out our pipeline on a GPU cluster.

Databricks Multi-Node with GPUs on AWS

Having a GPU-based multi-node Databricks cluster is pretty much the same as having a single-node cluster. The only difference is choosing Standard and keeping the same ML/GPU Runtime with the same AWS Instance specs we chose in our benchmarks for GPU on a single node.

We also need to install Spark NLP in this new cluster via the Libraries tab. Same as before, you can follow the steps I mentioned in Single Node Databricks with a GPU.

Databricks multi-node (standard) cluster with GPUs

Scale Spark NLP on GPUs with 2x nodes

Our multi-node Databricks GPU cluster uses the same AWS GPU instance of g4dn.8xlarge that we used previously to run our benchmarks to compare Spark NLP vs. Hugging Face on a single-node Databricks cluster.

This is a summary of what it looks like this time with 2 nodes:

Databricks with 2x Nodes — with 1 GPU per node

I am going to run the same pipeline in this GPU cluster with 2x nodes:

Spark NLP image-classification pipeline on 2x nodes with GPUs — predicting 34742 images

It took 4 minutes (231 seconds) to finish predicting classes for 34K images. Let’s compare this result on 2x Nodes with Spark NLP vs. Hugging Face on GPUs in Databricks:

Spark NLP is 185% faster than Hugging Face with 2x Nodes

Spark NLP with 2x Nodes is almost 3x times faster (185%) than Hugging Face on 1 single node while using GPU.

Scale Spark NLP on GPUs with 4x nodes

Let’s resize our GPU cluster from 2x Nodes to 4x Nodes. This is a summary of what it looks like this time with 4x Nodes using a GPU:

Databricks with 4x Nodes — with 1 GPU per node

Let’s run the same benchmark on 4x Nodes and see what happens:

Spark NLP image-classification pipeline on 4x nodes with GPUs — predicting 34742 images

This time it took almost 2 minutes (118 seconds) to finish classifying all 34K images in our dataset. Let’s visualize this just to have a better view of what this means in terms of Hugging Face in a single node vs. Spark NLP in a multi-node cluster:

Spark NLP is 458% faster than Hugging Face with 4x Nodes

That’s a 458% increased performance compared to Hugging Face. We just made our pipeline 5.6x times faster by using Spark NLP with 4x nodes.

Scale Spark NLP on GPUs with 8x nodes

Next, I will resize the cluster to have 8x Nodes in my Databricks with the following summary:

Databricks with 8x Nodes — with 1 GPU per node

Just as a reminder, each AWS instance (g4dn.8xlarge) has 1 NVIDIA T4 GPU 16GB (15GB useable memory). Let’s re-run the benchmark and see if we can spot any improvements as scaling out in any distributed system have its overheads and you cannot just keep on adding machines:

Spark NLP image-classification pipeline on 8x nodes with GPUs — predicting 34742 images

It took almost a minute (61 seconds) to finish classifying 34K images with 8x Nodes in our Databricks cluster. It seems we still managed to improve the performance. Let’s put this result next to previous results from Hugging Face in a single node vs. Spark NLP in a multi-node cluster:

Spark NLP is 980% faster than Hugging Face with 8x Nodes

Spark NLP with 8x Nodes is almost 11x times faster (980%) than Hugging Face on GPUs.

Scale Spark NLP on GPUs with 10x nodes

Similar to our multi-node benchmarks on CPUs I would like to resize the GPU cluster one more time to have 10x Nodes and match them in terms of the final number of nodes. The final summary of this cluster is as follows:

Databricks with 10x Nodes — with 1 GPU per node

Let’s run our very last benchmark in this specific GPU cluster (with zero code changes):

Spark NLP image-classification pipeline on 10x nodes with GPUs — predicting 34742 images

It took less than a minute (51 seconds) to finish predicting classes for over 34743 images. Let’s put them all next to each other and see how we progressed scaling out our Vision Transformer model coming from Hugging Face in the Spark NLP pipeline in Databricks:

Spark NLP is 1200% faster than Hugging Face with 10x Nodes

And we are done!

We managed to scale out our Vision Transformer model coming from Hugging Face on 10x Nodes by using Spark NLP in Databricks! Our pipeline is now 13x times faster with 1200% performance improvements compared to Hugging Face on GPU.

Let’s sum up all these benchmarks by comparing first the improvements between CPUs, and GPUs, and then how much faster our pipeline can be by going from Hugging Face CPUs to 10x Nodes on Databricks by using Spark NLP on GPUs.

Bringing it all together:

Databricks: Single Node & Multi Nodes

Spark NLP 🚀 on 10x Nodes with CPUs is 1000% (11x times) faster than Hugging Face 🤗 stuck in a single node with CPUs

Spark NLP 🚀 on 10x Nodes with GPUs is 1192% (13x times) faster than Hugging Face 🤗 stuck in a single node with GPU

What about the price differences between our AWS CPU instance and AWS GPU instance? (I mean, you get more if you pay more, right?)

AWS m5d.8xlarge with CPUs vs. AWS g4dn.8xlarge with 1 GPU and similar specs

OK, so the price seems pretty much the same! With that in mind, what improvements do you get if you move from Hugging Face on CPUs stuck in a single machine to Spark NLP on 10x Nodes with 10x GPUs?

Spark NLP on GPUs is 25x times (2366%) faster than Hugging Face on CPUs

Spark NLP 🚀 on 10x Nodes with GPUs is 2366% (25x times) faster than Hugging Face 🤗 in a single node with CPUs

Final words

In the spirit of full transparency, all the notebooks with their logs, screenshots, and even the excel sheet with numbers are provided here on GitHub
Scaling Spark NLP requires zero code change. Running the benchmarks from a single node Databricks to the 10 nodes meant just re-running the same block of code in the same notebook
Keep in mind these two libraries come with many best practices to optimize their speed and efficiency in different environments for different use cases. For instance, I didn’t talk about partitions and their relation to parallelism and distributions in Apache Spark. There are many Spark configs to fine-tune a cluster, especially balancing the number of tasks between CPUs and GPUs. Now the question is, would it be possible to speed up any of them within the very same environments we used for our benchmarks? The answer is 100%! I tried to keep everything for both libraries with default values and right out-of-the-box features in favor of simplicity for the majority of the users.
You may want to wrap Hugging Face and other DL-based Pythonish libraries in a Spark UDF to scale them. This works to a degree as I have done this myself and still do (when there is no native solution). I won’t get into the details of excessive memory usage, possible serialization issues, higher latency, and other problems when one wraps such transformer-based models in a UDF. I would just say if you are using Apache Spark use the library that is natively extending your required features on Apache Spark.
Throughout this article, I went out of my way to mention Hugging Face on PyTorch and Spark NLP on TensorFlow. This is a big difference given the fact that in every single benchmark done by Hugging Face between PyTorch and TensorFlow, PyTorch was and still is the winner for inference. In Hugging Face, PyTorch just has a much lower latency and it seems to be much faster than TensorFlow in Transformers. The fact that Spark NLP uses the very same TensorFlow and comes ahead in every benchmark compare to PyTorch in Hugging Face is a big deal. Either the TensorFlow in Hugging Face is neglected, or PyTorch is just faster in inference compared to TensorFlow. Either way, I can’t wait to see what happens when Spark NLP starts supporting TorchScript and ONNX Runtime in addition to TensorFlow.
The ML and ML GPU Databricks runtimes come with Hugging Face installed, that’s pretty nice. But that doesn't mean Hugging Face is easy to use in Databricks. The Transformer library by Hugging Face doesn’t support DBFS (the native distributed file system of Databricks) or Amazon S3. As you see in the notebooks, I had to download a compressed version of datasets and extract them to use them. That’s not really how users in Databricks and other platforms in productions do things. We keep our data within distributed file systems, there are security measures implemented, and most of them are large enough that cannot be downloaded by a personal computer. I had to download the datasets I already had on DBFS, zip them, upload them on S3, make them public, and re-download them again in the notebooks. A pretty tedious process that could have been avoided if Hugging Face could support DBFS/S3.

References

ViT

Hugging Face

Databricks

Spark NLP

Scale Vision Transformers (ViT) Beyond Hugging Face