Code Review: It's Bad, Expensive, and Ineffective [in most cases]

I am reviewing a practice we’re all aware of: code review.

In this post, I will try to investigate the rationality of “blocking mode CR” — a CR where code is not deployed until reviewed.

CR used to find defects

One of the first goals that CR is expected to achieve is usually stated as «find defects».

Usually, CR is used in «blocking» or «sync» way whereas the code is not made available to the client until it gets reviewed.

There is an interesting history of this tool.

In the 60th software deployment and verification cycle was quite long and computer time was more expensive than developers’ time.

The cycle consisted of the following steps:

typing code using a typewriter
passing the papers with typed code to the operators
operators punching cards with what they read (transferring text to cards)
loading cards to the computer (usually at night)
running the code
inspecting the test computation results
if anything goes wrong, loading previous software version’s cards to the computer and rejecting the new version — moving it back to development

You can easily guess why the CR instrument appeared, which stage it was added to, and why it was very cost-effective in a blocking way.

CR was done by a group of developers and yet it still was cost-effective as defects’ find-and-fix cycle was extremely long and expensive.

Problematic sociodynamics that I highlight below was not that important as software developers’ time was cheap and the demand for developers’ labor was low.

Decades passed, computers’ computation time became extremely cheap, defects’ find-and-fix cycle became very short in most cases while software developers’ time was becoming more and more expensive.

The question of CR’s economical efficiency for defects’ detection started to arise around 80th when many companies shifted from group CR to mini-group CR or even peer CR (1-to-1).

In 1996 A. Porter, H. Siy, и L. Votta published their work A review of software inspections where they provided CR approaches taxonomy and analyzed their economical efficiency.

These were the conclusions:

At the global level we see that software inspection is still an effective method for detecting and removing defects. However, whether it is cost-effective remains to be seen.

By the end of the last millennia, there were only a few areas where the loading code -> running code -> checking results cycle was more expensive than developers’ time spent on CR.

On the other side more and more automated defects detecting tools appeared:

Compilers, interpreters, and debuggers were providing more and more information
Linters and static analyzers were improving
Unit, integrational and functional tests were getting faster and faster
Pair programming practices were becoming more popular

It seems that by 2000 using CR to detect defects had almost no economical justification but was still used widely.

My hypothesis is that this using a tool which is not effective was due to the unprecedented growth the industry witnessed: companies grew blazingly fast, developers got promoted to managerial roles with almost no training so that rational sociotechnical systems and processes design was (and still is) almost a “no-clue”.

Pascarella, Spadini, Palomba, Bruntink, Bachelli in their Information Needs in Contemporary Code Review work are somewhat proving my hypothesis:

Even though studies have shown that modern code review has the potential to support software quality and dependability [17, 39, 41], researchers have also provided strong empirical evidence that the outcome of this process is rather erratic and often unsatisfying or misaligned with the expectations of participants [2, 10, 37].

Also Alberto Bacchelli & Christian Bird in their 2013’s Expectations, Outcomes, and Challenges of Modern Code Review, 2013 work find the following:

Our results show that, although the top motivation driving code reviews is finding defects, the practice and the actual outcomes are less about finding errors than expected: Defect related comments comprise a small proportion and mainly cover small logical low-level issues.

Once again, it seems that now there is pretty much no economical benefit in using CR for finding defects.

If your team expects CR to find defects, I encourage you to check the efficiency of the process.

Calculating the cost

The typical process involving «blocking» CR will look somewhat similar to this:

a snippet of code (usually a feature or some self-contained part of it) is passed to «CR» stage
... code awaits for CR ...
code is reviewed, defects are found (step 4) or not found (step 6)
code is passed back to development
... code awaits fixing ...
code is fixed and passed to CR stage again

This is an optimistic scenario where code gets fixed only once. You can certainly gather stats on the time code spent in each state.

Google calculated the overall time spent:

During the week, 70% of changes are committed less than 24 hours after they are mailed out for an initial review.

Checking the effect

Now take a few code revisions and check how many defects were found during the CR process and formulate how critical they were.

At this stage, it should be obvious if you’re paying the right price for the number of defects you find.

You might also find that you are amongst those who rarely find any defects during CR at all — Jacek Czerwonka, Michaela Greiler, and Jack Tilford at Microsoft found that:

Code reviews do not find as many bugs as you may think

To summarise, you might find that «blocking» CR is economically ineffective as a tool used for finding defects.

The next goal most often cited as one that CR achieves is knowledge sharing.

CR as a knowledge sharing tool

Alberto Bacchelli & Christian Bird in their 2013’s Expectations, Outcomes, and Challenges of Modern Code Review, 2013 work have this:

one of the things that should be happening with code reviews over time is a distribution of knowledge. If you do a code review and did not learn anything about the area and you still do not know anything about the area, then that was not as good code review as it could have been.

...

from a code review you can learn about the different parts you have to touch to implement a certain feature

...

Programmers answering the survey declared knowledge transfer to be their first motivation for code review in 73 (8%) cases, their second in 119 (14%), and their third in 141 (16%).

These findings do show that for some reason developers and managers consider CR to be a proper tool for knowledge sharing.

It seems that if we consider CR as knowledge-sharing tool, we should not even consider using it in “blocking” mode — as knowledge sharing can be done asynchronously.

CR as a process of teaching

What is teaching?

Teaching is a pedagogical (or, in most of our cases, andrological) process of organizing and stimulating students’ activities aimed at understanding and appropriating certain knowledge and skills.

So teaching process essentially is managing the process of learning. The teaching process is bi-directional and interactive, it includes both information transfer and accepting, understanding and appropriating of the transferred information. The process includes activities performed by both teacher and student.

Effective teaching process generally consists of the following steps:

The teacher provides information, gives theoretical background, and guides through practical activities
The student spends time on practicing
The teacher verifies results and highlights mistakes
Student practices again

With CR the process consists of these steps:

The student spends time on practicing
The teacher verifies results and highlights mistakes
Student practices again

Isn’t it obvious that the essential step is missing here?

The teacher never explains why and how things should be done, so the student has to rely on either knowledge from previous occasions when she practiced or has to blindly guess.

Also, the proper teaching process improves students’ skills, so that every next task in the same knowledge domain is done with fewer and fewer mistakes.

If CR was effective as a teaching process, wouldn’t every developer have zero mistakes after some time? If yes, how long have you been using CR and why do you still use it?

If CR highlights multiple mistakes in developers’ work every time, I would suggest you consider spending time on proper teaching.

Or, ditch the CR (for this goal) all together in favor of design review.

Another negative implication CR has here is that developer learns the wrong way first, then the reviewer highlights mistakes and the developer corrects herself. This means that the wrong way of doing is remembered first.

Other negative effects CR has

Negative sociodynamics due to association of work results with self

Every human being associates her work results with herself.

Valuing things we spend effort on doing is innate to our nature.

Whenever a developer receives feedback like “this has to be redone”, she will inevitably associate this feedback with herself.

The more effort a person spent on doing something, the more vulnerable she is to the critics. And while she hadn’t been taught properly how the problem should have been solved, she considers negative feedback even worse.

Internet’s full of posts and articles on how to “solve” this “negativity problem”. Most of these posts talk about how to provide more “positive” feedback.

Microsoft even hired a fully dedicated specialist working on this issue.

However, this all seems palliative treatment, none of these approaches offer to solve the problem properly.

Solving the problem properly would be introducing the teaching or planning phase before the effort on doing the task is spent and a feeling of achievement emerges.

Cost of focus switch

Using CR in blocking mode means that the team is naturally motivated to review the code as soon as possible to get it deployed.

This means that “blocking” CR introduces one more reason for developers to switch their context.

There are multiple publications on how multitasking and switching contexts are ineffective:

There’s good scientific evidence that it’s more effective to work on one single task at a time with as few context switches as possible.