This Open-Source Library Accelerates AI Inference by 5-20x in a Few Lines of Code

How does nebullvm work?

It takes your AI model as input and outputs an optimized version that runs 5-20 times faster on your hardware. In other words, nebullvm tests multiple deep learning compilers to identify the best possible way to execute your model on your specific machine, without impacting the accuracy of your model.

And that's it. In just a few lines of code.

And a big thank you to everyone for supporting this open-source project! The library received 250+ Github stars⭐ on release day, and that's just amazing 🚀

Orientation Map

Let's learn more about nebullvm and AI optimization. Where should we start? From...

Some CONTEXT on why few developers optimize AI and related negative consequences
An overview of how the LIBRARY works
Some USE CASES, technology demonstrations and benchmarks
A description of the TECHNOLOGY behind the library

Or let's jump straight to the library → nebullvm

Context

Finally, the adoption of Artificial Intelligence (AI) is growing rapidly, although we are still far from exploiting the full potential of this technology.

Indeed, what typically happens is that AI developers spend most of their time on data analysis, data cleaning, and model testing/training with the objective of building very accurate AI models.

Yet... few models make it into production. If they do, two situations arise:

AI models are developed by skilled data scientists and great AI engineers, who often have limited experience with cloud, compilers, hardware, and all the low-level matters. When their models are ready to be deployed, they select the first GPU or CPU they can think of on the cloud or their company/university server, unaware of the severe impact on model performance (i.e. much slower and more expensive computing) caused by uninformed hardware selection, poor cloud infrastructure configuration, and lack of model/hardware post-training optimization.

Other companies have developed in-house AI models that work robustly. AI inference is critical to these companies, so they often build a team of hardware/cloud engineers who spend hours looking for out-of-the-box methods to optimize model deployment.

Do you fall into one of these two groups? Then you might be interested in the nebullvm library, and below we explain why.

Library

How does nebullvm work?

You import the library, nebullvm does some magic, and your AI model will run 5-20 times faster.

And that's it. In just a few lines of code.

The goal of nebullvm library is to let any developer benefit from deep learning compilers without having to waste tons of hours understanding, installing, testing and debugging this powerful technology.

Nebullvm is quickly becoming popular, with 250+ GitHub stars on release day and hundreds of active users from both startups and large tech companies. The library aims to be:

💻 Deep learning model agnostic. nebullvm supports all the most popular architectures such as transformers, LSTMs, CNNs and FCNs.

🤖 Hardware agnostic. The library now works on most CPUs and GPUs and will soon support TPUs and other deep learning-specific ASICs.

🔥 Framework agnostic. nebullvm supports the most widely used frameworks (PyTorch, TensorFlow and Hugging Face) and will soon support many more.

🔑 Secure. Everything runs locally on your machine.

☘️ Easy-to-use. It takes a few lines of code to install the library and optimize your models.

✨ Leveraging the best deep learning compilers. There are tons of DL compilers that optimize the way your AI models run on your hardware. It would take tons of hours for a developer to install and test them at every model deployment. The library does it for you!

Use cases

Why is accelerating computing by 5-20x so valuable?

To save time → Accelerate your AI services and make them real-time.

To save money → Reduce cloud computing costs.

To save energy → Reduce the electricity consumption and carbon footprint of your AI services.

Probably you can easily grasp how accelerated computing can benefit your specific use case. We'll also provide you with some use cases on how nebullvm is helping many in the community across different sectors:

Fast computing makes search and recommendation engines faster, which leads to a more enjoyable user experience on websites and platforms. Besides, near real-time AI is a strict requirement for many healthtech companies and for autonomous driving, when slow response time can put people's lives in danger. The metaverse and the gaming industry also require near-zero latency to allow people to interact seamlessly. Speed can also provide an edge in sectors such as crypto/NFT/fast trading.

Lowering costs with minimal effort never hurts anyone. There is little to explain about this.

Green AI is a topic that is becoming more popular over time. Everyone is well aware of the risks and implications of climate change and it is important to reduce energy consumption where possible. Widespread awareness of the issue is reflected in how purchasing behavior across sectors is moving toward greater sustainability. In addition, low power consumption is a system requirement in some cases, especially on IoT/edge devices that may not be connected to continuous power sources.

Technology Demonstration

We suggest testing the library on your AI model right away by following the installation instructions on Github. If instead you want to get a hands-on sense of the library's capabilities, check out the notebooks at this link where you can test nebullvm on popular deep learning models. Note that notebooks will still require you to install the library as you will to test nebullvm on your models, which will take several minutes. Once it's installed, nebullvm will optimize your models in a short time.

Benchmarks

We have also tested nebullvm on popular AI models and hardware from leading vendors.

Hardware: M1 Pro, NVIDIA T4, Intel Xeon, AMD EPYC
AI Models: EfficientNet, Resnet, SqueezeNet, Bert, GPT2

At first glance, we can observe that acceleration varies greatly across hardware-model couplings. Overall, the library provides great positive results, most ranging from 2 to 30 times speedup.

To summarize, the results are:

Nebullvm provides positive acceleration to non-optimized AI models

Early results show poorer (yet positive) performance on Hugging Face models. Support for Hugging Face has just been released and improvements will be implemented in future versions
Nebullvm provides a ~2-3x boost on Intel hardware. These results are most likely related to an already highly optimized implementation of PyTorch for Intel devices
Extremely good performances on NVIDIA machines
The library provides great performances also on Apple M1 chips
And across all scenarios, nebullvm is very useful for its ease of use, allowing you to take advantage of deep learning compilers without having to spend hours studying, testing and debugging this technology

The table below shows the response time in milliseconds (ms) of the non-optimized model and the optimized model for the various model-hardware couplings as an average value over 100 experiments. It also displays the speedup provided by nebullvm, where speedup is defined as the response time of the optimized model over the response time of the non-optimized model.

Hardware used for the experiment is the following:

M1 Pro → Apple M1 Pro 16GB of RAM
Intel Xeon → EC2 Instance on AWS - t2.large
AMD EPYC → EC2 Instance on AWS - t4a.large
Nvidia T4 → EC2 instance on AWS - g4dn.xlarge

Technology

Nebullvm leverages the best deep learning compilers to accelerate AI models in inference.

So what exactly are deep learning compilers?

A deep learning compiler takes your model as input and produces an efficient version of it that runs the model computation graph faster on a specific hardware.

How?

There are several methods that, in principle, all attempt to rearrange the computations of neural networks to make better use of the hardware memory layout and optimize hardware utilization.

In very simplistic terms, deep learning optimization can be achieved by optimizing the entire end-to-end computation graph, as well as by restructuring operators (mainly for loops related to matrix multiplications) within the graph [1, 2]. Here are some examples of optimization techniques:

Operator fusion. It refers to the process where a sequence of operators eligible for fusion is first identified and then replaced with a corresponding handwritten implementation. Fusing operators allows for better sharing of computation, removal of intermediate allocations, and facilitates further optimization by combining loop nests. [3]
Quantization. It refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point values. [4, 5]
Graph pruning. Pruning refers to removing certain parameters in the neural network because they are redundant and do not contribute significantly to the output, resulting in a smaller, faster network. [6]

Deep learning optimization depends greatly on the specific hardware-software coupling, and specific compilers work best on specific couplings. So it is difficult to know a priori the performance of the many deep learning compilers on the market for each specific use case and testing is necessary. This is exactly what nebullvm does, saving programmers countless hours.

Acknowledgements

The team behind nebullvm are a group of former MIT, ETH, and EPFL folks who team up together and launched Nebuly. They developed this open-source library along with a lot of other great technologies to make AI more efficient. You can find out more about Nebuly on its website, LinkedIn, Twitter or Instagram.

Many kudos go to Diego Fiori, the library's main contributor. Diego is a curious person and always thirsty for knowledge, which he likes to consume as much as good food and wine. He is a versatile programmer, very jealous of his code, and never lets his code look less than magnificent. In short, Diego is the CTO of Nebuly.

Huge thanks also go to the open-source community that has developed numerous DL compilers that enable to accelerate AI models.

And finally, many thanks to all those who are supporting the nebullvm open-source community, finding bugs and fixing them, and enabling the creation of a state-of-the-art, this super-powerful AI accelerator.

References

Papers and articles about deep learning compilers.

[1] A friendly introduction to machine learning compilers and optimizers by Chip Huyen
[2] The Deep Learning Compiler: A Comprehensive Survey by Mingzhen Li & Al
[3] Principled optimization of dynamic neural networks by Jared Roesch
[4] A Survey of Quantization Methods for Efficient Neural Network Inference by Amir Gholami & Al.
[5] Quantization for Neural Networks by Lei Mao
[6] Neural Network Pruning 101 by Hugo Tessier

Documentation of deep learning compilers used by nebullvm.