Does AI-generated Code Need to Be Tested?

AI-powered tools for writing code, such as GitHub Copilot [1], are increasingly popular in software development. These tools promise to boost productivity, but some also claim that they democratize programming by allowing non-programmers to write applications.

But how do we actually know whether the code written by an AI tool is fit for purpose?

In the following you are going to learn what “fit for purpose” even means and what tools you can use to assess it. We will see that AI-powered tools for writing code cannot guarantee anything regarding the functional correctness and security of the code they suggest. However, we will also see that there are actually AI tools, such as Diffblue Cover, that can support you in answering the above question.

Stepping back in time

It’s worthwhile to begin by stepping back in history a bit, because this question is not unfamiliar at all: we’ve always had to ask ourselves how do we actually know that code written by humans is fit for purpose? People have been scratching their head for decades to find solutions for this fundamental problem in software engineering.

From the earliest days of programmable computer systems, engineers had nasty surprises when programs did not do what they intended. Back then trial and error cycles to get programs right were very expensive. Low-level code needed to be handcrafted and punched into cards. The main countermeasure to unnecessary cycles was code review [2]. Code review means that an expert reads and tries to understand what the code does in order to flag mistakes and give improvement suggestions - a successful technique that continues to be widely practiced today. However, the effectiveness of reviews and the effort to conduct them thoroughly decreases dramatically with the size and complexity of the programs.

Soon the question came up of how to actually tell in a more rigorous way whether a program is doing what it is supposed to be doing. The challenge is how to express what the program is supposed to be doing? Somehow it needs to be communicated to the machine what the user actually wants. This is a highly challenging problem that is still waiting to be fully solved today.

The problem is common to product engineering across all disciplines and is broken up into two steps, which are usually formulated as the questions: (1) Have we built the right product? (2) Have we built the product right?

Validation and verification

Assessing whether the right product has been built is known as validation [3]. Validation is ultimately in the hands of the user to assess whether the product fulfills its intended purpose.Verification starts from a requirements specification, which serves as means of communication between the builders of the product and the user or customer. Specifications are supposed to be understood by both sides, the user (a domain expert) and the engineer (who may not be a domain expert). Verification amounts to assessing whether the implementation of the product conforms to the requirements specification. Validation is clearly the harder problem because what the right product is is highly subjective and therefore hard to automate.

Good news is that the verification part can be fully automated - in principle, up to computational and complexity-theoretical limits. I.e. it can be mathematically proven that an implementation satisfies the specification. This discipline is known as formal methods or formal verification (e.g. [4]) and relies on logic-based formalisms to write specifications and automated reasoning to perform the proofs.

As promising as this sounds, the main problem is again who writes the specifications. It requires someone who is a domain expert and an expert in writing formal specifications - such people are hard to find and very expensive. Even when you have found such a person and you have succeeded in verifying your implementation, you are still left with the validation part of the problem, i.e. whether the specification actually describes the right product from the user’s point of view.

It has been commonly observed that specifications are often “more wrong” than the implementation, because it is extremely difficult to get formal specifications right (e.g. [5]). Another problem is scaling automated reasoning to large systems. Hence, in practice, formal verification has found its place for comparably small, complex, and critical (safety, security, financial) software from embedded control, cryptography, operations system kernels, to smart contracts.

A different perspective

In the 1970s there was another idea to approach the problem from a different angle, called N-version programming [6]. The basic idea is roughly: since it’s so hard to get programs and even their specifications right, let multiple independent teams implement the system and then vote on the output. The underlying assumption is that different teams make different mistakes; they may also have different interpretations of the requirements specification. So, on the whole the result is expected to be “more correct” than a single implementation in terms of verification and maybe even validation. However, it turned out that the assumption is wrong: even independent teams make the same mistakes [7].

A legacy of this approach is, however, that verification can be viewed as 2-version programming: requirements specification and implementation are two sides of the same coin. They describe the same system in different ways using different formalisms and points of view. Also, they are often written by different people or teams. Neither is the specification in any sense “more correct” than the implementation, nor vice versa. This way of thinking can guide us to a realistic view on what can be achieved in practice.

So, why do we even bother having both specifications and implementations? The benefit comes from 2-version programming: comparing two descriptions of the same systems allows us to gain confidence where they agree with each other and find bugs where they disagree, enabling us to reflect on both descriptions and ultimately arrive at “more correct” descriptions.

Testing is a verification technique

Now, some readers may interject: We don’t have specifications, so we can’t do that. How does this concern us? You may not have a solid requirements specification, but you may actually test your software. True, we haven’t talked about testing yet. What is testing actually doing in the context of our discussion?

Testing is a verification technique. Tests check your implementation and the assertions in your tests are - you guessed it? - the specification. How often has it happened to you that the bug was not in the implementation, but in the tests? This is not surprising since testing is just 2-version programming. So, having a good testing practice with end-to-end tests for the high level requirements and thorough unit testing for the low-level ones indeed increases confidence in delivering the right product right. And what about model-based engineering? Yes, just 2-version programming with the same properties. This is nothing bad - it’s just the nature of the beast.

So do we need to test AI-written code?

Assessing whether an application written by an AI tool is actually fit for purpose is as difficult as doing the same for human-written code. The hard part is answering the validation question: the person doing this assessment must be an expert in the application domain.

A tool for automatically writing tests, such as Diffblue Cover [8], helps you in this process as the tests that it creates give you a behavioral input-output view on the code. Diffblue Cover doesn’t make guesses - it bluntly tells you what the code is actually doing and thus helps you assess the validation and verification questions. The tests serve as a baseline for regression testing when you make changes to code going forward - independently of whether it was written by a human or a machine.

The chances are quite good that non-programmers who are experts in their application domain will benefit from AI tools for programming applications. However, they need to be aware that implementations written by tools such as GitHub Copilot are not correct by construction - functional correctness and security properties are not ingrained in the underlying models they use. Even if trained on “correct” (we now know how much this means) and secure code, these properties are not preserved by training and model evaluation.

AI to rescue AI

AI-powered tools make it easy to program applications that do something. However, the application code must be expected to suffer from similar issues as human-written code. Thus, you need to employ the same rigor in testing and QA that is used for human-written code. Confidence in these processes can be increased in combination with an AI tool for automatically writing tests, such as Diffblue Cover, that can guarantee the tests it produces describe the actual behavior of the code.

[1] GitHub Copilot, https://github.com/features/copilot

[2] M. Fagan, ”Design and Code Inspections to Reduce Errors in Program Development”, IBM Syst. J. 15(3), pp. 182-211, 1976.

[3] H. Pham, “Software Reliability”, Wiley, 1999.

[4] E. Clarke, O. Grumberg, D. Kroening, D. Peled and H. Veith, “Model Checking”, 2nd Ed., MIT Press, 2018.

[5] O. Legunsen, W. Hassan, X. Xu, G. Rosu and D. Marinov, “How good are the specs? A study of the bug-finding effectiveness of existing Java API specifications”, Proceedings of ASE, pp. 602-613, 2016.

[6] A. Avizienis and L. Chen, "On the Implementation of N-version Programming for Software Fault-Tolerance During Execution", Proceedings of COMPSAC, pp. 149-155, 1977.

[7] J. Knight and N. Leveson, “An experimental evaluation of the assumption of independence in multiversion programming”, IEEE Trans. Softw. Eng. 12(1), pp. 96-109, 1986.

[8] Diffblue Cover, https://www.diffblue.com