The Green Build Deceit: Why passing tests are insidious

You don’t need to be pernickety to find a build monitor all in green appealing. It’s how humans operate. It’s satisfying and rewarding to have someone — the computer, in this case — tell you that the things you built work. The question is: Do they really?

Even if you practice test driven development (which you probably don’t), your tests will only ever test the things that you yourself told them to test. Sounds obvious, but one is easily deceived.

Back in the 1960 already, about 30 years before TDD was first introduced, Edsger W. Dijkstra said:

“Testing shows the presence, not the absence of bugs.”

A very simple sentence, which — although quite famous — is not taken seriously enough, I believe. There is this very philosophically toned article online which dissects this remark. I don’t want to steal anything off it, so I would recommend you give it a read if you are interested.

What I’m trying to get across is: A passing test never means that your software works; it means you didn’t find any problems with it.

Perhaps you remember what happened to GitLab in January 2017 (kudos to GitLab — by the way — for going public with it; very laudable). The biggest problem — in my opinion — was, that they didn’t have any working backups, despite them thinking they did. If I had to guess, I would say, they had a backup pipeline that reported a success every day (or whenever they ran it), but in reality failed and didn’t notify anyone.

No matter if it’s a failing backup pipeline or a bunch of unittests for your python application: A green build on its own is worth close to nothing. What good is it, being told everything is fine, if it really isn’t. It’s not only not helpful; it’s insidious. At least a red build tells me that there was something wrong.

Let’s just look at this one simple example: I have a python function that tells me whether a customer is allowed to buy beer, given their age (yes I live in Europe):

def is_allowed_to_buy_beer(customer):
    return customer.age >= 16

This piece of code has a cyclomatic complexity of 2 (there are two possible ways the code can flow: Either the customer is of age or they aren’t):

def test_of_age(self):

    self.assertEqual(
      is_allowed_to_buy_beer(
        self.MockCustomer(17)), True)

def test_too_young(self):
     
    self.assertEqual(
      is_allowed_to_buy_beer(
        self.MockCustomer(15)), False)

These two tests (I stripped some of the boilerplate) therefore are everything I need in order to reach full code coverage.

python3 -m nose --with-coverage
..
Name          Stmts   Miss  Cover
---------------------------------
sellbeer.py       2      0   100%
---------------------------------
Ran 2 tests in 0.005s

OK

Green test. 100% coverage. But it doesn’t really proof much. It’s hard to judge whether it was really worth my time writing 15 lines of test code to ensure whether I implemented the comparison operator (≥) the right way. What if someone calls the function with an object that doesn’t have an age attribute, or with a string or None. Sure, I will maybe catch that case somewhere else and have a test ready for it there. But what if I don’t? There is an abundance of possibilities which may not cross my mind when I implement an algorithm. The test, however, will stay green forever.

I’m certainly not advocating not writing any tests, neither do I want to trivialise your efforts of getting a build green. But it is important to scrutinise what is actually happening.

An engineer’s goal can never be to reach high code coverage or to fill the pie chart on Jenkins all in green. It always has to be, to create products that do what they are supposed to do.