The Revolutionary Potential of 1-Bit Language Models (LLMs)

Anyone interested in the evolution of Artificial Intelligence technology knows that today's solutions are all about Large Language Models (LLMs) and transformers. In a nutshell, LLMs are neural networks that can predict the next token based on the input tokens. Typically, these tokens are words (this isn't entirely accurate, but it's easier to conceptualize this way), and the network's output is also a word. This is how ChatGPT works. You input a question, and the network generates a word. Then, the question and the word together become the network input, generating another word, and so on, until a complete answer is formed.

However, tokens can be more than just words. Advanced language models like GPT-4 or Gemini are now multimodal, meaning their input can include images and words. Just as a sentence can be broken down into words, an image can be divided into small patches, and from there, the same transformer architecture can process them. For instance, a multimodal network can be asked to describe what is in an image or to code the user interface seen in the picture.

This architecture is even more general. DeepMind's Gato system is a prime example, where a single transformer network can simultaneously answer questions, play video games, or control a robot, and robots have even been controlled using ChatGPT. Since an LLM works with tokens and any task can be tokenized, an LLM provides a universal solution for any task.

One of the most hyped tech news stories recently was about the companyGroq developing an ASIC (Application-Specific Integrated Circuit) that can run LLMs much more efficiently and with less energy than traditional GPUs. This clearly shows that LLM architecture has become so fundamental that it's now worthwhile to create specialized hardware for it.

Also recently, a publication titled "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" made its appearance. Quantization of neural networks is a common method for reducing size and computational demand. The essence of the solution is to perform the training on large GPU clusters using floating-point numbers, and then convert the finished network's weights into a less precise format, which allows the processors of user devices to work more efficiently. For example, the training is carried out with 16 or 32-bit floating-point numbers, which are then converted into 8 or 4-bit fixed-point numbers for fast client-side operations. This way, the model can function well even on mobile or IoT devices. An extreme form of this quantization is when the weights are converted into 1-bit numbers. This can be a complete binary conversion, or as the publication suggests, using the values {-1,0,1} (hence the 1.58 bits). One might think that such extreme quantization would render the network completely unusable, but in reality, the opposite is true; these 1-bit networks perform exceptionally well.

Why is this such a big deal?

If these three values are sufficient to represent the weights, then multiplication, which is currently the most frequently used operation in neural networks, is no longer necessary. This is why GPU clusters are used for neural networks, as GPUs can perform multiplications very efficiently. Without the need for multiplications, there's no need for GPUs, and the models can be run efficiently even on CPUs, or it's possible to build specialized hardware (ASIC) that can (even in an analog way) run these 1-bit networks.

Currently, quantization is a post-training operation. Thus, the use of 1-bit networks doesn't accelerate the training process. Nevertheless, they are still useful because training is a one-time operation, but the neural network is then run countless times. Consequently, running the networks represents a significantly greater energy consumption than training. Therefore, we might benefit from this technology even in the context of training.

Since gradient-based training does not work with 1-bit or binarized networks, non-gradient-based technologies become relevant (checknevergrad and PyGAD), like genetic algorithms or other gradient-free technologies. Although in most cases backpropagation is much more efficient than gradient-free solutions, 1-bit networks can be run much more efficiently than their floating-point counterparts. So, it might be that with backpropagation, we find the optimal network 10 times faster using floating-point numbers than with, say, genetic algorithms. But if the 1-bit network runs 20 times faster, then training will still be twice as fast using genetic algorithms. Investigating how effectively 1-bit networks can be trained with gradient-free methods could be a very interesting research topic.

Another reason why this topic is so fascinating is that these networks more closely resemble the neural networks found in the natural brain (biologically plausible). Therefore, I believe that by choosing a good gradient-free training algorithm and applying these 1-bit networks, we can build systems that are much more similar to the human brain. Moreover, this opens up the possibility for technological solutions beyond ASICs that were previously not feasible, such as analog, light-based, or even biologically based processors.

It's possible that this direction might turn out to be a dead-end in the long run, but for now, its revolutionary potential is apparent, making it a very promising research avenue for anyone involved in the field of artificial intelligence.