Is Test Coverage a Good Metric for Test or Code Quality?

Let the flame wars begin.

Firstly, definitions.

As with all good opinion pieces, I’ll be clear about the terms I’m using and what they mean.

Code Coverage Percentage

The lines in the code that are executed when one of the automated tests run, expressed as a percentage of the entire codebase. For example, 65% code coverage would mean that the tests execute 65% of the code.

“Good Metric”

For a metric to be “good” in this context, it must have some kind of relationship with code that is higher quality. Quality is defined as easier to understand, change and maintain.

So what does a high code coverage percentage tell us?

Consider the following example code, freakishly simple. It takes a value and gives the output of a map. The map is hard coded for simplicity.

public class MyService {

Map<String, String> myMap = new HashMap<>();  

public MyService() {  
    myMap.put("1", "v1");  
}  

public String generateValue(String input) {  
    String mapValue = parseValue(input);  
    return myMap.get(mapValue);  
}  

private String parseValue(String input) {  
    return input.substring(1);  
}

}

So what does 100% code coverage look like here?

A simple unit test using spock might look something like this. This will give us the full 100% test coverage.

def "test my service does its thing properly"() {given: 'my service'MyService myService = new MyService()

when: 'I run my service with a value ending in 1'  
String output = myService.generateValue('l1')  

then: 'I expect something back'  
output != null

}

So what do we know?

When we have 100% test coverage we know that a test has ran all of the lines in this code.

Cool, so it’s fully tested!

Uh, no. Let’s go and look at my code for a minute — what happens if I pass in null? What if someone messes with the internals of the class to change the output? Would our 100% test coverage tell us anything about that. Hell no. What’s going on here?

Code Coverage is a Dumb Metric

It’s measurements are reliable when you’re tracking how much of your code is ran by your tests but it tells you absolutely nothing of the value of those tests. Visualising it on its own is useless because it has no reliable, predictive relationship with the quality of the code or the tests. Quality gates that prevent releases for minor decreases in code coverage are invites for crazy shit — tests like this.

def "test my setter works"() {given: 'an instance of my domain object'MyDomainObject myDomainObject = new MyDomainObject()

when: 'I set some value on my domain object'  
myDomainObject.setSomeValue('this is a value')  

then: 'i expect that value to be set'  
myDomainObject.getSomeValue() == 'this is a value'

}

Sure, you could test it, but come on. Do you really think your tests should be polluted with this kind of noise? You should be testing your logic, not every single setter in a domain object — that’s the very reason tools like Lombok exist. This is the problem with being overzealous with code coverage — developers will find a way to make up the numbers. You create a system they don’t believe in and they’ll game it, it’s what they do. This is the definition of waste.

I’m not saying code coverage is a bad metric, it’s just dumb

As part of a chorus of metrics, it can help to paint a picture of what is going on in the code base. If it’s a flat 0%, it might give you some insight into the coding standards and the testing practices of a team, but it will not tell you whether those tests are any good or whether the whole testing strategy is broken. That’s really what you care about — high quality code that is thoroughly tested across multiple domains. Security, performance, resilience, etc.

Let’s take this for a bit of a walk. Here’s a little thought experiment — you have two options.

100% of the lines in your codebase are ran by tests. Those tests are like the implementation above — they don’t tell you much about what the code should be doing.
50% of the lines in your codebase are executed by tests. Those tests thoroughly test the logic and provide excellent living documentation of the code that it does cover. However, there are some holes in the tests. Some classes not covered.

I bet your eyes are on the latter. That’s because this thought experiment helps to pull out the real value of a good set of tests in the codebase. They’re supposed to stop bad things from happening and make it easier to continue working on the code. The former does not do that, despite executing all the code. The latter does.

So is code coverage on its own a good indicator of code quality?

Nope, not on its own. The above example shows a simple but common case where code coverage is reporting full marks but the code looks like it was written by an eight year old. Code coverage will give you quantity when what you need is quality. Remember, ten good tests blow 100 garbage tests out of the water, any day of the week.

For more technical rambling, go ahead and follow me on twitter!