Automating the Correctness Assessment of AI-generated Code for Security Contexts: Experimental Setup

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Domenico Cotroneo, University of Naples Federico II, Naples, Italy;

(2) Alessio Foggia, University of Naples Federico II, Naples, Italy;

(3) Cristina Improta, University of Naples Federico II, Naples, Italy;

(4) Pietro Liguori, University of Naples Federico II, Naples, Italy;

(5) Roberto Natella, University of Naples Federico II, Naples, Italy.

Table of Links

Abstract & Introduction

Conclusion & References

4. Experimental Setup

4.1. AI-code Generation

To perform code generation and assess the tool on the AI-generated code, we adopted four state-of-the-art NMT models.

■ Seq2Seq is a model that maps an input of sequence to an output of sequence. Similar to the encoder-decoder architecture with attention mechanism [21], we use a bi-directional LSTM as the encoder to transform an embedded intent sequence into a vector of hidden states with equal length. We implement the Seq2Seq model using xnmt [22]. We use an Adam optimizer [23] with β1 = 0.9 and β2 = 0.999, while the learning rate α is set to 0.001. We set all the remaining hyper-parameters in a basic configuration: layer dimension = 512, layers = 1, epochs = 200, beam size = 5.

■ CodeBERT [24] is a large multi-layer bidirectional Transformer architecture [25] pre-trained on millions of lines of code across six different programming languages. Our implementation uses an encoder-decoder framework where the encoder is initialized to the pre-trained CodeBERT weights, and the decoder is a transformer decoder, composed of 6 stacked layers. The encoder follows the RoBERTa architecture [26], with 12 attention heads, hidden layer dimension of 768, 12 encoder layers, and 514 for the size of position embeddings. We set the learning rate α = 0.00005, batch size = 32, and beam size = 10.

■ CodeT5+ [27] is a new family of Transformer models pre-trained with a diverse set of pretraining tasks including causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. We utilize the variant with model size 220M, which is trained from scratch following T5’s architecture [28]. It has an encoder-decoder architecture with 12 decoder layers, each with 12 attention heads and hidden layer dimension of 768, and 512 for the size of position embeddings. We set the learning rate α = 0.00005, batch size = 16, and beam size = 10.

■ PLBart [29] is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for code-to-text, text-to-code, code-to-code tasks. The model is pre-trained on a large collection of Java and Python functions and natural language descriptions collected from GitHub and StackOverflow. We use the PLBart-large architecture with 12 encoder layers and 12 decoder layers, each with 16 attention heads. We set the learning rate α = 0.00005, batch size = 16, and beam size = 10. We followed the best practices in the field of code generation by supporting NMT models with data processing operations. The data processing steps are usually performed both before translation (pre-processing), to train the NMT model and prepare the input data, and after translation (postprocessing), to improve the quality and the readability of the code in output.

Our pre-processing operations start with the stopwords filtering, i.e., we remove a set of custom-compiled words (e.g., the, each, onto) from the intents to include only relevant data for machine translation. Next, we use a tokenizer to break the intents into chunks of text containing space-separated words (i.e., the tokens). To improve the performance of the machine translation [30, 31, 2], we standardize the intents (i.e., we reduce the randomness of the NL descriptions) by using a named entity tagger, which returns a dictionary of standardizable tokens, such as specific values, label names, and parameters, extracted through regular expressions. We replace the selected tokens in every intent with “var#”, where # denotes a number from 0 to |l|, and |l| is the number of tokens to standardize. Finally, the tokens are represented as real-valued vectors using word embeddings.

The pre-processed data is used to feed the NMT model. Once the model is trained, we perform the code generation from the NL intents. Therefore, when the model takes as inputs new intents, it generates the related code snippets based on its knowledge (model’s prediction). As for the intents, also the code snippets predicted by the models are processed (post-processing) to improve the quality and readability of the code. Finally, the dictionary of standardizable tokens is used in the de-standardization process to replace all the “var#” with the corresponding values, names, and parameters.

4.2. Dataset

To feed the models for the generation of security-oriented code, we extended the publicly available Shellcode IA32 dataset [32, 33] for automatically generating shellcodes from NL descriptions. A shellcode is a list of machine code instructions to be loaded in a vulnerable application at runtime. The traditional way to develop shellcodes is to write them using the assembly language, and by using an assembler to turn them into opcodes (operation codes, i.e., a machine language instruction in binary format, to be decoded and executed by the CPU) [34, 35]. Common objectives of shellcodes include spawning a system shell, killing or restarting other processes, causing a denial-of-service (e.g., a fork bomb), leaking secret data, etc.

The dataset consists of instructions in assembly language for IA-32 collected from publicly available security exploits [36, 37], manually annotated with detailed English descriptions. In total, it contains 3, 200 unique pairs of assembly code snippets/English intents. We further enriched the dataset with additional samples of shellcodes collected from publicly available security exploits, reaching 5, 900 unique pairs of assembly code snippets/English intents. To the best of our knowledge, the resulting dataset is the largest collection of shellcodes in assembly available to date.

Our dataset also includes 1, 374 intents (∼23% of the dataset) that generate multiple lines of assembly code, separated by the newline character \n. These multi-line snippets contain many different assembly instructions (e.g., whole functions). For example, the copy of the ASCII string “/bin//sh” into a register is a typical operation to spawn a shell, which requires three distinct assembly instructions: push the hexadecimal values of the words “/bin” and “//sh” onto the stack register before moving the contents of the stack register into the destination register. Further examples of multi-line snippets include conditional jumps, tricks to zero-out the registers without generating null bytes, etc. Table 1 shows two further examples of multi-line snippets with their natural language intents.

Table 2 summarizes the statistics of the dataset used in this work, including the unique examples of NL intents and assembly code snippets, the unique number of tokens, and the average number of tokens per snippet and intent. The dataset is publicly available on GitHub [2].

4.3. Baseline Assessment Solutions

As a baseline for the evaluation, we used the following output similarity metrics, which are widely used to assess the performance of AI generators in many code generation tasks [15], including the generation of assembly code for security contexts [38, 1, 3, 2, 33]:

• Compilation Accuracy (CA). It indicates whether each code snippet produced by the model is compilable according to the syntax rules of the target language. CA value is either 1, when the snippet’s syntax is correct, or 0 otherwise. To compute the compilation accuracy, we used the Netwide Assembler (NASM) assembler [18].

• Bilingual Evaluation Understudy (BLEU) score [39]. It measures the degree of n-gram overlapping between the string of each code snippet produced by the model and the reference. This metric also takes into account a brevity penalty to penalize predictions shorter than the references. BLEU value ranges between 0 and 1, with higher scores corresponding to a better quality of the prediction. Similar to previous studies, we use the BLEU-4 score (i.e., we set n = 4). We implemented BLEU score computation employing the bleu score module contained in the open-source Python suite Natural Language Toolkit (NLTK) [40].

• SacreBLEU [41]. This is a different implementation of the BLEU score which differs from the traditional one because it uses different tokenization techniques. We used the implementation available on Hugging Face [42]

• Exact Match accuracy (EM). It indicates whether each code snippet produced by the model perfectly matches the reference. EM value is 1 when there is an exact match, 0 otherwise. To compute the exact match, we used a simple Python string comparison.

• Edit Distance (ED). It measures the edit distance between two strings, i.e., the minimum number of operations on single characters required to make each code snippet produced by the model equal to the reference. ED value ranges between 0 and 1, with higher scores corresponding to smaller distances. For the edit distance, we adopted the Python library pylcs [43].

As a further baseline for the comparison, we adopted the famous ChatGPT [44], the AI-powered language model developed by OpenAI. For every code generated by the models, we asked ChatGPT to assess the code generated by the models by assigning a value 1 when it is correct, 0 otherwise. We performed two different evaluations:

• ChatGPT-NL: ChatGPT evaluates if the code generated by the models is the translation in assembly code of what is required in the natural language intents, similar to what a human evaluator does during the manual code review;

• ChatGPT-GT: ChatGPT evaluates if the code generated by the models is semantically equivalent to the ground truth used as a reference for the evaluation, similar to the assessment performed by output similarity metrics.

4.4. Human Evaluation

To ensure a robust and thorough assessment of both ACCA and baseline approaches in evaluating AI-generated code, we conducted a meticulous comparison against human evaluation, which serves as ground truth for our analysis.

In the human evaluation, a code prediction was deemed successful if it accurately translated the NL description into the assembly language and adhered to the established assembly programming language rules, warranting a score of 1. Any deviation resulted in a score of 0.

To fortify the integrity of our evaluation and minimize the potential for human error, 3 authors, well-versed in assembly for IA-32 architecture and shellcode development, independently scrutinized each code snippet generated by the models. Any discrepancies that arose were attributed to human oversight and promptly rectified, culminating in unanimous consensus across all cases in the human evaluation, demonstrating a resounding 100% alignment.

[2] https://github.com/dessertlab/Shellcode_IA32