ChatGPT vs Copilot vs Programmers: Who's Coming Out on Top?

Written by anywhichway | Published 2023/01/26
Tech Story Tags: artificial-intelligence | chatgpt | ai | github-copilot | openai | code-generator | can-ai-code | hackernoon-top-story | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-vi | hackernoon-fr | hackernoon-pt | hackernoon-ja

TLDRThis article compares the suggestions of OpenAI [ChatGPT] and [Codex] to Microsoft [Copilot] to hand-written code with a focus on algorithms that are key to processing large volumes of data efficiently. It uncovers that the AI’s produce either average or flawed code. and provides suggestions for general improvement.via the TL;DR App

With the increased availability of no-code/low-code environments and the advent of AI applied to programming, there have been a number of articles posted saying that the end is near for programmers either because end users will create their own apps or AI will write apps from descriptions. Perhaps … but not yet.

In this article, I explore the current state of AI-assisted coding by comparing the suggestions of OpenAI ChatGPT and Codex to Microsoft Copilot to hand-written code. The AIs produce either average or flawed code.

Since so much focus today is placed on processing large volumes of data (including the large language models used for ChatGPT and Copilot), I focus on algorithms that are key to processing data efficiently. Hopefully, the AIs “understand” their own nature better than code in general!

These algorithms are:

  • intersection

  • cross-product

  • memoizing

I provided ChatGPT with directions for the form:

Show me <some function description>

I provided Codex with input:

<some function description>

I provided Copilot with comments in a WebStorm IDE

// <some function description>

The function descriptions were the same for each.

Testing Methodology

I used Jasmine as the unit test harness supplemented by code similar to that found in Benchtest to get speed plus memory and CPU utilization.

Intersection

Show me a JavaScript function that takes a variable number of arrays and returns their intersection.

OpenAI ChatGPT

function intersection(...arrays) {
    if (arrays.length < 2) {
        throw new Error("At least 2 arrays are required!");
    }

    let result = new Set(arrays.shift());
    for (let arr of arrays) {
        result = new Set([...result].filter(x => arr.includes(x)));
    }
    return [...result];
}

OpenAI Codex

function intersection(arrays) {
  var result = [];
  var args = Array.prototype.slice.call(arguments);
  var arrays = args.slice(0, args.length - 1);
  var callback = args[args.length - 1];
  var minLength = Math.min.apply(null, arrays.map(function(arr) { return arr.length; }));
  for (var i = 0; i < minLength; i++) {
    var item = arrays[0][i];
    var isInAll = arrays.every(function(arr) { return arr.indexOf(item) !== -1; });
    if (isInAll) {
      result.push(item);
    }
  }
  return callback(result);
}

Microsoft Copilot

const intersection = (...arrays) => {
    const [first, ...rest] = arrays;
    return first.filter(x => rest.every(y => y.includes(x)));
}

ChatGPT assumed that providing less than one array is not valid. I can imagine situations in which providing one argument might be valid.

By using a Set, ChatGPT ensures that no duplicates exist in results. The intersection is supposed to be a set operation, and duplicates should be removed. The Copilot code will return an array that may contain duplicates.

Neither ChatGPT nor Copilot sorts the original arguments ascending by length. This is a trivial optimization that could make a big difference!

If any argument has a zero length, there is no intersection; regardless, it shortens the loops because the maximum intersection is the same as the shortest array argument.

Codex provided complex code that, based on my knowledge, looked quite efficient. However, its use of arguments is obscure and involves a somewhat archaic programming style.

Note, a callback must be provided as the last argument to return the intersection to the calling code, but the argument is not shown.

I then asked ChatGPT and Copilot to give me an efficient function.

Show me an efficient JavaScript function that takes a variable number of arrays and returns their intersection.

ChatGPT

function intersection(...arrays) {
    if (arrays.length < 2) {
        throw new Error("At least 2 arrays are required!");
    }
    let result = new Set();

    for (let item of arrays[0]) {
        let include = true;
        for (let i = 1; i < arrays.length; i++) {
            if(!arrays[i].includes(item)){
                include = false;
                break;
            }
        }
        if(include){
            result.add(item);
        }
    }
    return [...result];
}

Copilot generated the same code as the first request.

ChatGPT also added commentary saying the function would not work as expected for objects, although arguably it would. So, I asked for a function that would support objects:

Show me an efficient JavaScript function that takes a variable number of arrays containing objects or primitives and returns their intersection.

function intersection(...arrays) {
    if (arrays.length < 2) {
        throw new Error("At least 2 arrays are required!");
    }
    let result = new Set();

    for (let item of arrays[0]) {
        let include = true;
        for (let i = 1; i < arrays.length; i++) {
            if(!arrays[i].find(x => JSON.stringify(x) === JSON.stringify(item))){
                include = false;
                break;
            }
        }
        if(include){
            result.add(item);
        }
    }
    return [...result];
}
This function takes an arbitrary number of arrays as arguments and first checks if at least 2 arrays are passed. Then it uses a nested loop. The outer loop iterates over the first array and the inner loop iterates over the remaining arrays. It compares each item of the first array with each item of the remaining arrays using Array.prototype.find() method and JSON.stringify() to check if the objects are the same. If an item is found in all arrays, it is added to

ChatGPT stringifies objects to compare them. There are a number of issues with this:

  1. Two objects that have the same string representation might legitimately be separate objects.

  2. Even if stringifying is valid, different property orders will result in different strings, and the objects might be representing the same entity.

  3. JSON.stringify followed by a string comparison is a slow operation.

And now, the real test! Below, are benchmark results for speed and memory using code generated by ChatGPT, Codex, Copilot, and the two fastest intersection libraries available via NPM, fastArrayIntersect and intersector.

The benchmark intersected 3 identical arrays of 10,000 integers each and a final array of 5,000 integers with a sample size of 100. A test was also run to confirm the functions returned sets when duplicate entries existed in the source arrays.

Source

Duplicates

Heap Used

Ops Sec

User CPU

System CPU

ChatGPT

No

39768

6.65

152170

780

Codex

No

5475888

16.00

69070

160

Copilot

Yes

30768

4.16

345190

940

Intersector

No

37304

73.02

21020

160

Fast Array Intersection

No

42064

20.33

18990

630

Copilot uses the least heap but is also the slowest and most CPU-intensive.

ChatGPT is the least CPU-intensive but is also slow.

Fast Array Intersect is 3 to 5 times as fast as Copilot and Chat GPT and 20% faster than Codex while being less CPU-intensive.

Intersector is 15x faster than ChatGPT and Copilot, 4x faster than Codex, and more than 3x faster than Fast Array Intersect while also having the lowest CPU utilization.

In summary:

  1. Neither ChatGPT nor Copilot generate efficient code.

  2. ChatGPT makes assumptions that may not be valid.

  3. Copilot generates code that does not produce a set when its arguments contain duplicate values.

  4. Codex produces reasonable performant, although obscure code.

  5. The programmers win.

Both ChatGPT and Copilot could be improved by applying their “intelligence” to asking the user a few clarifying questions, e.g., are you looking for speed or memory efficiency, do you want to require at least two arguments?

Cross Product

To be clear on the objective of this section, I was looking for all possible combinations of items in arrays, not an arithmetic cross-product of vectors.

Show me a JavaScript function for Cartesian cross-product.

OpenAI ChatGPT

function cartesianCrossProduct(v1, v2) {
    var x = v1.y*v2.z - v1.z*v2.y;
    var y = v1.z*v2.x - v1.x*v2.z;
    var z = v1.x*v2.y - v1.y*v2.x;
    return {x: x, y: y, z: z};
}

OpenAI Codex

function crossProduct(arrays) {
  var result = [];
  var args = Array.prototype.slice.call(arguments);
  var arrays = args.slice(0, args.length - 1);
  var callback = args[args.length - 1];
  var maxLength = Math.max.apply(null, arrays.map(function(arr) { return arr.length; }));
  for (var i = 0; i < maxLength; i++) {
    var item = arrays[0][i];
    var isInAll = arrays.every(function(arr) { return arr.indexOf(item) !== -1; });
    if (isInAll) {
      result.push(item);
    }
  }
  return callback(result);
}

Microsoft Copilot

const crossProduct = (...arrays) => {
    const [first, ...rest] = arrays;
    return first.flatMap(x => rest.map(y => [x, y]));
}

Unfortunately, ChatGPT provided a function for arithmetic cross-product. In my opinion, it should have provided both types of functions or asked for clarification.

After multiple attempts at different queries, I was able to get ChatGPT to generate all three of the below:

function crossProduct(...arrays) {
    if (arrays.length < 2) {
        throw new Error("At least 2 arrays are required!");
    }

    let result = arrays[0].reduce((acc, x) => {
        return acc.concat(arrays.slice(1).reduce((acc, y) => {
            return acc.concat(y.map(z => [x].concat(z)));
        }, []))
    }, []);

    return result;
}
function allCombinations(...args) {
    var result = [];
    var f = function(prefix, arr) {
        for (var i = 0; i < arr.length; i++) {
            var newPrefix = prefix.concat(arr[i]);
            result.push(newPrefix);
            if (args.length > 0) {
                f(newPrefix, args.shift());
            }
        }
    }
    f([], args[0]);
    return result;
}
function crossProduct(...args) {
    var result = [];
    for(let i = 0; i < args.length; i++){
        for(let j = i + 1; j < args.length; j++){
            for(let ii = 0; ii < args[i].length; ii++){
                for(let jj = 0; jj < args[j].length; jj++){
                    result.push([args[i][ii], args[j][jj]]);
                }
            }
        }
    }
    return result;
}

Copilot

const crossProduct = (...arrays) => {
    const [first, ...rest] = arrays;
    return first.flatMap(x => rest.map(y => [x, y]));
}

As with intersection, Codex provided a function using an old programming style with obscure argument handling.

Those familiar with Cartesian products will know that they can be explosive from both a memory utilization and performance perspective.

Naive implementations will consume sufficient RAM to store all permutations and not return a result until all permutations have been generated (which could be a long time). The ChatGPT and Copilot functions both suffer from these shortcomings.

So, I subsequently asked for a memory efficient JavaScript function for Cartesian product.

ChatGPT

function cartesianProduct(...args) {
    const iterators = args.map(arg => arg[Symbol.iterator]());
    const next = () => {
        const values = iterators.map(iterator => iterator.next().value);
        return values.length > 1 ? values : null;
    };
    return {
        [Symbol.iterator]: () => ({ next })
    };
}

Copilot provided the same results as it did previously.

I assumed the “memory efficient” version from ChatGPT used less RAM than the first one. However, I knew it was not optimal. After several attempts, I came up with a memory efficient JavaScript function for Cartesian product using a generator.

ChatGPT

function* cartesianProduct(...args) {
    const recursion = (arrays, i = 0) => {
        if (i === arrays.length - 1) {
            yield* arrays[i];
        } else {
            for (let item of arrays[i]) {
                for (let result of recursion(arrays, i + 1)) {
                    yield [item, ...result];
                }
            }
        }
    }
    yield* recursion(args);
}

To be honest, I was pleasantly surprised at this response from ChatGPT. However, attempts to run this code failed because yield [item,...result] is not inside a generator, it is inside a nested function called recursion .

I also received a slight variation to which I lost the query:

function* crossProduct(...arrays) {
    if (arrays.length < 2) {
        throw new Error("At least 2 arrays are required!");
    }

    function* recurse(depth, current) {
        if (depth === arrays.length) {
            yield current;
        } else {
            for (let item of arrays[depth]) {
                yield* recurse(depth + 1, current.concat(item));
            }
        }
    }

    yield* recurse(1, [arrays[0]]);
}

Once again, ChatGPT seemed to be making some assumptions about arguments.

Copilot returned the same code as it did previously. It completely ignored my request for a generator.

And now, the real test! Below are benchmark results for speed and memory using code generated by ChatGPT, Copilot, and the fastest, most memory efficient Cartesian products of which I am aware CXProduct and Lazy Cartesian Product.

Also included is a one-line function that perhaps either ChatGPT or Copilot should have returned (you will see why I included this when you review the results).

//https://stackoverflow.com/questions/12303989/cartesian-product-of-multiple-arrays-in-javascript
const cartesian =
    (...a) => a.reduce((a, b) => a.flatMap(d => b.map(e => [d, e].flat())));

The test takes the Cartesian product of 4 arrays with 10 items each. Hence, the functions should produce a Cartesian product of 10 * 10 * 10 * 10 items, i.e., 10,000.

Source

# Results

Heap Used

Ops Sec

User CPU

System CPU

ChatGPT1

300

N/A

N/A

N/A

N/A

ChatGPT2

50

N/A

N/A

N/A

N/A

ChatGPT3

600

N/A

N/A

N/A

N/A

ChatGPT Memory Efficient

N/A

N/A

N/A

N/A

N/A

ChatGPT Generator

1000

N/A

N/A

N/A

N/A

Codex

4

N/A

N/A

N/A

N/A

Copilot

30

N/A

N/A

N/A

N/A

One Line

10000

0

78.46

18150

0

CXProduct

10000

0

291.50

7330

0

LazyProduct

10000

17824

266.65

11560

150

In summary:

  1. ChatGPT, Codex, and Copilot are incapable of generating code that produces the correct result for a Cartesian product.

  2. ChatGPT sometimes makes assumptions that may not be valid, e.g., requiring 2 arguments.

  3. ChatGPT generated code that it said was more memory efficient but it would not even run.

  4. The programmers win!

Once again, ChatGPT could be improved by applying its “intelligence” to asking the user a few clarifying questions, e.g., are you looking for speed or memory efficiency, do you want to require at least two arguments? However, before this, it must generate the correct code!

Memoization

Having learned my lesson with intersection and Cartesian products, I immediately asked a refined question for memoization.

Show me an efficient JavaScript function that memoizes object and primitive arguments.

OpenAI ChatGPT

function memoize(fn) {
    let cache = new WeakMap();
    return function(obj, ...args) {
        let key;
        if (obj) {
            if (!cache.has(obj)) {
                cache.set(obj, {});
            }
            key = JSON.stringify([obj, ...args]);
        } else {
            key = JSON.stringify(args);
        }
        if (cache.has(key)) {
            return cache.get(key);
        }
        let result = fn.apply(this, [obj, ...args]);
        cache.set(key, result);
        return result;
    }
}

ChatCPT generated invalid code; a WeakMap must use an object as a key. The code was modified as follows for testing.

function memoize(fn) {
    let cache = new Map();
    return function(obj, ...args) {
        let key;
        if (obj!=null) {
            if (!cache.has(obj)) {
                cache.set(obj, {});
            }
            key = JSON.stringify([obj, ...args]);
        } else {
            key = JSON.stringify(args);
        }
        if (cache.has(key)) {
            return cache.get(key);
        }
        let result = fn.apply(this, [obj, ...args]);
        cache.set(key, result);
        return result;
    }
}

Open AI Codex

function memoize(func) {
  var cache = {};
  return function() {
    var args = Array.prototype.slice.call(arguments);
    var key = JSON.stringify(args);
    if (cache[key]) {
      return cache[key];
    } else {
      var result = func.apply(null, args);
      cache[key] = result;
      return result;
    }
  }
}

Microsoft Copilot

const memoize = (fn, cache = new Map()) => {
    return (...args) => {
        const key = JSON.stringify(args);
        if (cache.has(key)) {
            return cache.get(key);
        } else {
            const result = fn(...args);
            cache.set(key, result);
            return result;
        }
    }
}

Codex and Copilot generated the same function which should not be surprising given that Copilot is based on Codex. Although, in the earlier examples, they differed.

ChatGPT, Codex, and Copilot generated inefficient functions, JSON.stringify, followed by a string compare is slow and memory-consumptive at scale.

There are also JavaScript values that do not stringify, e.g., Infinity and NaN. (Unfortunately, the JavaScript JSON spec was defined prior to the age of data science and micro-services, and it was assumed that Infinity and NaN implied errors in code that either were not legitimate or did not need to be transported from one place to another.)

So now, for the proof of efficiency by comparing ChatGPT and Copilot code to nano-memoize and micro-memoize by generating the 12th Fibonacci number using this code:

const fibonacci = (number) => {
  return number < 2 ? number : fibonacci(number - 1) + fibonacci(number - 2);
};

Source

Heap Used

Ops Sec

User CPU

System CPU

ChatGPT (Corrected)

102552

45801

620

0

Codex

17888

52238

320

0

Copilot

17888

51301

320

0

nano-memoize

17576

93699

470

160

micro-memoize

18872

82833

620

0

Nano-memoize is the fastest and almost twice as fast as ChatGPT, Codex, and Copilot code. It also uses less memory. Micro-memoize is close on its heels.

Although CPU utilization for nano-memoize and micro-memoize is slightly higher than Code and Copilot, the performance is worth it, and the programmers once again win!

Summary & Hypotheses

Although there is surely value in the use of both Copilot and ChatGPT for code generation, it should be done with care. Neither will produce optimum code and in some cases, it will simply be invalid or worse, incorrect. Additionally, when using ChatGPT, queries should be quite specific.

ChatGPT and Copilot could both be improved by the addition of a feature that would ask clarifying questions.

ChatGPT, if it was truly intelligent, would either tell users to employ its sibling Codex for code generation or would simply use Codex in the background.

If it is using Codex in the background, then I am not sure what is happening when I provide the same function description to both and get different results.

Although I am not familiar with the internal workings of either tool, other than knowing they are language model based, I hypothesize it is unlikely they will get to a point where they can generate optimum code without overcoming this shortcoming:

A system trained on large volumes of publicly unvetted code is going to produce average results for code, i.e., code of average performance and code with an average number of bugs.

To provide consistently accurate results, the system will need:

  1. The ability to consume and use fragments of “counter sample” data, e.g., JSON.stringify, can be inefficient. The system might gain this capability by analyzing test results as well as code or being fed known optimum code with some type of weighting or simply through the critique of results by known experts. Unfortunately, optimum code is often not the most prevalent or most used, and simply feeding the models more examples will not help. In the ideal case, a really intelligent system would be able to generate its own test cases.

  2. A deeper, more first principle “understanding” of programming in order the analyze the code it generates for efficiency flaws, e.g., usually favor iteration over recursion for runtime efficiency, usually favor recursion for code size and readability.

  3. At a minimum, a code-generating AI should attempt to parse the code it has generated and evaluate its syntactic validity. This should be a simple enhancement to ChatGPT.

  4. Ideally, the AI would also run at least one or two simple test cases to ensure type validity. While I was creating unit tests, Copilot proposed a number of useful enhanced code completions that could be used for this purpose, e.g., arguments to function calls and array lookups. I assume ChatGPT and Codex could be enhanced to do such a thing.

Resources

I hope you have enjoyed this article. Have a great day, and make yesterday jealous of what you learned today (here or elsewhere)!


Also published here


Written by anywhichway | Working in the clouds around Seattle on multiple open source projects. Sailing when it is clear.
Published by HackerNoon on 2023/01/26