Optimizing Genetic Improvement with GPT 3.5 Turbo

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Alexander E.I. Brownlee, University of Stirling, UK;

(2) James Callan, University College London, UK;

(3) Karine Even-Mendoza, King’s College London, UK;

(4) Alina Geiger, Johannes Gutenberg University Mainz, Germany;

(5) Justyna Petke, University College London, UK;

(6) Federica Sarro, University College London, UK;

(7) Carol Hanna, University College London, UK;

(8) Dominik Sobania, Johannes Gutenberg University Mainz, Germany.

Table of Links

Abstract & Introduction

Experimental Setup

Results

Conclusions and Future Work

Acknowledgements & References

2 Experimental Setup

To analyze the use of LLMs as a mutation operator in GI, we used the GPT 3.5 Turbo model by OpenAI and the GI toolbox Gin [3]. We experimented with two types of search implemented within Gin: random search and local search. Requests to the LLM using the OpenAI API were via the Langchain4J library, with a temperature of 0.7. The target project for improvement in our experiments was the popular JCodec [7] project which is written in Java. ‘Hot’ methods to be targeted by the edits were identified using Gin’s profiler tool by repeating the profiling 20 times and taking the union of the resulting set.

For the random sampling experiments, we set up the runs with statementlevel edits (copy/delete/replace/swap from [14] and insert break/continue/return from [4]) and LLM edits, generating 1000 of each type at random. A timeout of 10000 milliseconds was used for each unit test to catch infinite loops introduced by edits; exceeding the timeout counts as a test failure. For local search, experiments were set up similarly. There were 10 repeat runs (one for each of the top 10 hot methods) but the runs were limited to 100 evaluations resulting in 1000 evaluations in total, matching the random search. In practice this was 99 edits per run as the first was used to time the original unpatched code.

We experimented with three different prompts for sending requests to the LLM for both types of search: a simple prompt, a medium prompt, and a detailed prompt. With all three prompts, our implementation requests five different variations of the code at hand. The simple prompt only requests the code without any additional information. The medium prompt provides more information about the code provided and the requirements, as shown in Figure 1. Specifically, we provide the LLM with the programming language used, the project that the code belongs to, as well as formatting instructions. The detailed prompt extends the medium prompt with an example of a useful change. This example was taken from results obtained by Brownlee et al. [4]. The patch is a successful instance of the insert edit applied to the jCodec project (i.e., an edit that compiled, passed the unit tests and offered a speedup over the original code). We use the same example for all the detailed prompt requests used in our experiments; this is because LLMs are capable of inductive reasoning where the user presents specific information, and the LLM can use that input to generate more general statements, further improved in GPT-4 [8].

Fig. 1. The medium prompt for LLM requests, with line breaks added for readability.

LLM edits are applied by selecting a block statement at random in a target ‘hot’ method. This block’s content is <code> in the prompt. The first code block in the LLM response is identified. Gin uses JavaParser (https://javaparser.org) internally to represent target source files, so we attempt to parse the LLM suggestion with JavaParser, and replace the original block with the LLM suggestion.

This paper is available on arxiv under CC 4.0 license.