Different Roles for Different Models: LLMs and Reinforcement Learning

The rise of large language models like ChatGPT, with their ability to generate highly fluent and accurate text, has been remarkable. However, these models are hampered by hallucination, the generation of incorrect or nonsensical text that is semantically or syntactically plausible. This is a serious problem that limits their usefulness, especially to automate complex, error-prone tasks at scale.

While some experts like Ilya Sutskever of OpenAI believe that reinforcement learning with human feedback can eliminate hallucinations, others like Yann LeCun of Meta and Geoff Hinton at Google argue that a more fundamental flaw in large language models is at work. Both believe that large language models lack non-linguistic knowledge, which is critical for understanding the underlying reality that language describes.

This is an interesting debate, and both experts highlight gaps in the capabilities of seemingly all-conquering large language models where other AI approaches can be more effective.

Software developers spend around 35% of their time testing software, so automating this task not only saves time but increases productivity. Large language models can write code suggestions and so much has been made of their usefulness in unit testing.

However, because LLMs trade accuracy for generalization, the best they can do is suggest code to developers, who then must check the code for effectiveness.

GitHub's Copilot, which uses OpenAI’s Generative Pre-trained Transformer model, a derivative of GPT-3, does not explicitly generate unit tests, but it can suggest code snippets for testing. For example, if a developer is writing a method that takes a list as input, Copilot may suggest code snippets that test the method with an empty list, a list with one item, and a list with multiple items. These suggested snippets can be used as a starting point for writing more comprehensive unit tests.

So, while Copilot can be helpful in generating some initial test cases, it is not a replacement for a comprehensive testing strategy.

Microsoft Research, the University of Pennsylvania, and the University of California, San Diego have proposed TiCoder (Test-driven Interactive Coder), which leverages user feedback to generate code based on natural language inputs consistent with user intent. It uses natural language processing and machine learning algorithms to assist developers in generating unit tests.

When a developer writes code, TiCoder asks the coder a series of questions to refine its understanding of the coder’s intent. It then provides suggestions and autocomplete options based on the code's context, syntax, and language. It generates test cases based on the code being written, suggesting assertions and testing various scenarios.

Both Copilot and TiCoder, as well as other LLM-based tools, may speed up the writing of unit tests, but they are fundamentally AI assistants to human coders who check their work, rather than productive AI-based coders in their own right. So is there a better way?

Geoff Hinton from Google points out that we learn to play basketball by throwing the ball so it goes through the hoop. We don’t learn the skill by reading about basketball – we learn by trial and error. And that’s the core idea behind Reinforcement Learning, an area of AI that has demonstrated impressive performance in tasks like game-playing. Reinforcement learning systems can be far more accurate and cost-effective than large language models because they learn by doing.

Diffblue Cover, for example, writes executable unit tests without human intervention, making it possible to automate complex, error-prone tasks at scale.

The product uses reinforcement learning to search the space of all possible test methods, write the test code automatically for each method, and select the best test among those written. The reward function for reinforcement learning is based on various criteria, including coverage of the test and aesthetics, which include coding style that look as if a human has written them. The tool creates tests for each method in an average of one second, and delivers the best test for a unit of code within one or two minutes at most.

Diffblue Cover is more similar to AlphaGo, DeepMind’s automatic system for playing the game Go, than Copilot or TiCoder. AlphaGo identifies areas of a huge search space where there are potential moves to win the game, and then uses reinforcement learning on these areas to select which move to make next. Diffblue Cover does the same with unit test methods, coming up with potential tests, evaluating them to find the best test, repeating this operation until it has built a full test suite.

If the goal is to automate the writing of 10,000 unit tests for a program no single person understands, reinforcement learning is the only real solution. Large deep-learning models just can’t compete – not least because there’s no way for humans to effectively supervise them and correct their code at that scale, and making models larger and more complicated doesn’t fix that.

While large language models like ChatGPT have wowed the world with their fluency and depth of knowledge, for precise tasks like unit testing, reinforcement learning is a more accurate and cost-effective solution.

Mathew Lodge ([email protected]) is CEO of Diffblue, an Oxford, UK-based AI startup.