From Crappy Autocomplete to ChatGPT: The Evolution of Language Models

Despite the current hype, language models aren’t new. We’ve had them in our phones for years, doing autocomplete. And, while they may save us a couple of seconds on spelling, no one would ever call them "smart" or “sentient.”

Technically, all language models are just probability distributions of tokens. They are trained to determine the next probable word or symbol, whichever was tokenized, given the previous ones. But, they can also be fine-tuned for other tasks such as language translation and question-answering.

What is language generation?

Language generation is the process of giving an algorithm a random word so it can generate the next one based on the probabilities it learned from the training data and then continuously feeding it its own output. For example, if the model sees "I,” we expect it to produce "am," then "fine," and so forth.

Its ability to create meaningful sentences depends upon the size of its reference window. Older basic models, such as those found in our phones, can only look one or two words back, which is why they are myopic and forget the beginning of a sentence by the time they reach the middle.

From RNNs to Transformers

Prior to transformers, researchers used Recurrent Neural Nets (RNNs) to fix the short memory problem. Without going into too much detail, we can say that their trick was to produce a hidden state vector containing information about all the nodes in the input sentence and update it with each new token introduced.

Although the idea was definitely clever, the hidden state always ended up being heavily biased toward the most recent inputs. Therefore, like basic algorithms, RNNs still tended to forget the start of the sentence, although not nearly as quickly.

Later, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks were introduced. Unlike vanilla RNNs, they had built-in mechanisms (gates) that helped to retain the memory of relevant inputs, even if they were far from the output being produced. But, these networks were still sequential in nature and had way too complex architectures. They were inefficient and prohibited parallel computation, so there was no chance of running them on multiple computers simultaneously to get a lightning-fast performance.

In 2017, transformers were first described in this paper by Google. As opposed to LSTMs and GRUs, they had the ability to actively choose the segments that were relevant for processing at a given stage and reference them when making an estimation. They were faster, more efficient, and had a simpler architecture based on the principle of attention.

It’s funny that if you read the work now, it sounds like a run-of-the-mill paper on machine translation, of which there were plenty at the time. The authors probably didn’t realize they might have invented one of the most important architectures in AI history.

Attention

In the context of machine learning, attention refers to the vectors assigned to each token that contains information about its position in a sequence and its importance relative to other input elements. The model can use them when making predictions without the need for serial processing. Let’s break it down a bit so it becomes clearer.

Before transformers, the traditional approach to sequence-to-sequence processing, such as neural language translation, was to encode all inputs into a single hidden state using an RNN and then decode the target sequence using another RNN. All that mattered on the encoding end was the final state.

In 2014, Bahdanau et al. proposed the brilliant idea of making all hidden states available to the decoder network and allowing it to determine which of them were the most important for generating the current output. The network paid attention to the relevant parts and ignored the rest.

Four years later, Google's paper was released. This time, the authors suggested ditching RNNs completely and just using attention for both the encoding and decoding phases. To do so, they had to make certain modifications to the original attention mechanism, which led to the development of self-attention.

Self-attention

It’s probably easiest to think of self-attention as a communication mechanism between nodes in a single sequence. The way it works is that all input tokens are assigned three vectors - Query (Q), Key (K), and Value (V) - which represent different aspects of their initial embedding.

The query vectors (Q) indicate what the input is looking for. Think of them as the phrases you type into the YouTube search bar.
The key vectors (K) serve as identifiers for the input, helping it locate matches for its query. These are sort of like the Youtube search results with relevant titles.
The value vectors (V) represent the actual content of each token and allow the model to determine the importance of a relevant node in relation to the query and generate output. These can be thought of as thumbnails and video descriptions that help you decide which video from the search results to click on.

Note: In self-attention, all Qs, Ks, and Vs come from the same sequence, whereas in cross-attention they are not.

The self-attention formula looks like this: Attention(Q,K,V) = softmax((QK^T) / sqrt(d_k)V. And, here’s the procedure in a nutshell:

Three linear transformations are applied to queries, keys, and values to create corresponding matrices - Q, K, V.
Dot products of Qs and Ks are computed; they tell us how well all the queries match with all the keys.
The resulting matrix is divided by the square root of the dimension of the keys d_k. This is a downscaling procedure that’s needed to achieve a more stable gradience (multiplying values might otherwise have an exploding effect).
The softmax function is applied to the scaled score and thus attention weights are obtained. This computation gives us values from 0 to 1.
The attention weights for each input are multiplied by their value vectors and that’s how the outputs are computed.
The outputs are passed through one more linear transformation, which helps to incorporate the data from self-attention into the rest of the model.

The GPT Family

Transformers were initially invented as a simple alternative to RNNs for encoding sequences, but over the past five years, they've been applied to various areas of AI research, including computer vision, and have frequently surpassed state-of-the-art models.

In 2018 though, we didn’t know how powerful they could be if made big (with millions of parameters), given ample computing power, and trained on vast, diverse, and unlabeled text corpora from the web.

The first glimpse of their capabilities was seen in the Generative Pre-trained Transformer (GPT) developed by OpenAI, which had 117 million parameters and was pre-trained on unlabeled data. It outperformed discriminately trained models in 9 out of 12 NLP tasks, despite the fact that these algorithms were specifically trained for the tasks and GPT was not.

Then came GPT-2 models (the largest one having 1.5 billion parameters), which were followed by many other transformers. And in 2020, OpenAI finally released GPT-3; its biggest version had 175 billion parameters, and its architecture was mostly the same as in GPT-2.

It seems that OpenAI’s goal was to determine how high a level of performance they could squeeze out of their model by just making it bigger and providing it with more text and power. The results were astonishing.

Note: 175 billion parameters is considered quite small by today’s standards.

GPT-3 is capable of generating texts in various styles and formats, such as novels, poems, manuals, scripts, news articles, press releases, image captions, song lyrics, emails, dialogue responses, etc. It can write code, summarize, rephrase, simplify, categorize any information, and much more. It would literally take a whole other article just to list all its capabilities. And yet, at its core, this beast is still a simple autocomplete system.

ChatGPT

OK, so we have an incredibly powerful language model. Can we just use it as a chatbot? No.

GPT-3 and its analogs are still tools for sequence completion and nothing more. Without proper direction, they will ramble on about the topic they pick up from your question and make up phony articles, news, novels, etc., that might appear to be fluent, cohesive, and grammatically impeccable, but they will rarely be useful.

To create a chatbot that is actually helpful, OpenAI conducted extensive fine-tuning of GPT-3 or GPT 3.5, the model’s updated version - we don’t exactly know yet. While much of the details about this process have not yet been revealed, we do know that the bot was trained in almost the same way as InstructGPT, its sibling model. And we’ve also noticed that the latter is in many ways similar to Sparrow, DeepMind’s yet-to-be-launched version of a ‘smart dialogue agent,’ described in this paper, which came out a bit later.

So, knowing that all transformer-based algorithms effectively have the same architecture, we can read the blog post from OpenAI, compare it to the Sparrow paper, and then make some educated guesses about what goes on under the hood of ChatGPT.

The fine-tuning process from the article had three stages:

Accumulating data that displays to the AI how an assistant should act. This dataset consists of texts where questions are followed by precise and useful answers. Luckily, large pre-trained language models are very sample-efficient, which means the process probably didn’t take that long.
Giving the model a tryout by having it respond to queries and generating multiple answers to the same question, and then having humans rate each answer. And at the same time, training a reward model to recognize desirable responses.
Using OpenAI's Proximal Policy Optimization to fine-tune the classifier and ensure that ChatGPT's replies receive a high score according to the policy.

The Sparrow paper describes a similar method but with a few additional steps. Like all DeepMind's dialogue agents, Sparrow is conditioned on specific hand-crafted prompts that act as inputs that are always given to the model by programmers and cannot be seen by users. ChatGPT is likely also guided by these kinds of ‘invisible’ prompts.

To make it an effective assistant, Sparrow was asked questions and it generated responses that were then evaluated by humans based on the general principles of usefulness and ethical rules that were put forth by DeepMind (such as politeness and accuracy). There was also an adversarial type of training where humans actively tried to make Sparrow fail. Then, two neural net classifiers were trained for its evaluation; one that scores the replies in terms of helpfulness and one that determines how far the answers deviate from DeepMind's rules.

ChatGPT now knows not to generate offensive content, but it did occasionally produce insensitive replies after the release; we think OpenAI might have added another model specifically designed to not let harmful text through. But of course, we can’t yet know for sure, and ChatGPT itself is being shady about this.

Unlike ChatGPT, Sparrow will also be able to provide evidence to support what it says, as it will cite sources and access Google search. In order to make the model able to do that, the researchers updated its initial prompt and introduced two more personas into it: the search query and the search result.

Note: The same principle has likely been applied in Bard, ChatGPT’s competitor recently announced by Google.

After the training with two classifiers, using the ELI5 dataset and the answers of Sparrow's prior iterations, the model is able to generate multiple accurate and well-researched answers to each question. The answer shown to the user is always the one that scores highest with the usefulness classifier and lowest with the rule-deviation classifier.

So, what’s next?

Bard, Google’s chatbot based on the LaMDA language model, was announced on February 6th. It is already generating buzz, but no specific details about its training have emerged yet. The beta version of Sparrow is also expected to be released sometime in 2023. It remains to be seen if either of these bots will become nearly as popular as ChatGPT. Both have unique features that give them the potential to become the new number-one chatbot, but we also don’t think that OpenAI will stop updating and improving its superstar chat assistant.

Perhaps we will see ChatGPT with new and even better features soon. It’s impossible to predict which company will end up on top in terms of market dominance. But, whoever wins the competition, will push the boundaries of what’s perceived to be achievable with AI technology even further, and it will surely be exciting.