The Science of Debugging (and Neutrinos)

“Help I’m getting this anomaly”

If your team uses any sort of chat, or you browse any coding-related sites, you’ve probably seen the “Help, I’m getting this error” messages, accompanied by some stack trace, cryptic message, or other evidence of something going wrong.

In the field of neutrino physics, the OPERA experiment did something similar when they tossed a preprint about an “anomaly” up on the arXiv back in 2011. They were observing neutrinos arriving at their detector in Italy roughly 60 nanoseconds faster than the speed of light should allow.

This “anomaly” is as close to a bug report for science as you’re likely to see. They posted the preprint primarily in order to get input from other physicists. Other physicists did offer help, writing over 60 papers about possible causes for the anomaly.

The anomaly disappeared after the experiment fixed a bad connection on a cable. It ended up being something extremely simple (I’m sure many software developers know that experience). And it’s unfortunate that the media hype blew the anomaly way out of proportion with article after article about faster-than-light neutrinos, which ultimately contributed to the project leaders resigning.

Coming from a physics background, I’ve noticed how similar my experience debugging code is to diagnosing an experimental anomaly. I also think about the time other physicists spent writing papers trying to explain what was ultimately something really dumb. Personally, I’d like to be more considerate of my colleagues’ time than that. Toward that end, I’d like to share this way of approaching debugging as a science experiment.

Check the literature

If you pick up a scientific paper and read past the abstract, you’ll notice the first section typically contains a summary of other research, citing other papers. The author has (usually) read these papers and decided that their experiment is sufficiently novel to warrant writing about. The OPERA preprint, for example, mentions other experiments that didn’t observe faster-than-light neutrinos.

With search engines and resources like StackOverflow, I research my software bugs. Sometimes searching for the error will turn up the solution instantly. If that happens, there’s nothing novel and I don’t need to do the experiment myself. If not, playing around with keywords can often tell me if other people have seen similar problems, and help guide my own experiments.

Set up a lab and apparatus

There’s a reason a lot of science goes on in labs: it minimizes external things that can affect whatever is being studied. A lab also contains specialized apparatuses that let them observe things better than they otherwise would be able to. The OPERA experiment is an example. CERN provides a controlled beam, and the detector can detect otherwise elusive neutrinos.

Similarly, I try to set up a “lab” for my code. If the buggy code is part of some larger system, I’ll try to set up a minimal environment that still allows it to exhibit the bug. Things like VMs and virtual environments are great for this. This gives me more control over its context and inputs. To this, I add instruments like debuggers and detailed logging. These let me look at the bug more easily and with greater detail than in the “wild”.

Once the lab’s set up, I need to build the apparatus. This involves bringing in enough of the relevant code and data to reproduce the bug. I also need to set it up to test the things I want to test. For example, if my research (or intuition) suggests a hypothesis that a particular input might be a problem, then I need a straightforward way to manipulate that input in my lab. Fortunately, it’s much easier to rebuild a software apparatus than most physical ones, so I can start quickly with something minimal, and add more controls when hypotheses demand.

Hypothesis, experiment, iterate

This is the heart of science, and also very useful when debugging software. Many small experiments get done to explain an anomaly or solve a bug. OPERA, for example, considered how things like the time-distribution of their neutrino beam could affect their measurement. Similarly, in software, I’ve usually have some hypotheses for what’s causing the bug that I can test.

Using the lab I’ve set up, I run an experiment to test a hypothesis. Maybe my first hypothesis is wrong (and it frequently is). In that case, I move on to the next one. Maybe in testing it, I learn something new about the system, and that leads to a different or more refined hypothesis or a new experiment I can run.

I keep running experiments and testing hypotheses until I understand what’s going on. This is ultimately what science is. We don’t simply do things until they work; we do things to understand what causes something to work. Simply making the error go away isn’t enough. If I don’t understand what caused it, I can’t know if I completely fixed it, or if there are other circumstances under which it might appear. I also don’t know how to avoid coding up the same bug again in the future.

And in some rare cases, I can’t figure it out. Maybe it’s actually a bug in some other system that I can’t experiment on. Maybe it’s something buried deep in code that’s way outside my expertise. In that case, it’s a true anomaly, at least to me.

Publish

In science, the end result of an experiment is a paper. This is a way to share whatever the experiment revealed to the broader community.

After debugging software, I don’t write papers, but I do publish the results in one way or another. For example, if I’ve found the cause of the bug, I might “publish” it by fixing it and putting the fix out for code review. In science, this would be a paper about discovering something. If the bug’s due to some bad assumption I made about the system, I can “publish” this with documentation. In science, the equivalent might be a paper about a new theory.

If it’s one of those bugs that I can’t figure out, then I can “publish” by asking for help. I hope I’ve convinced you that this is actually pretty close to an “anomaly” paper in science. Even the structure of a good bug report is similar: I give context and outline what research I’ve done. Then I explain my experimental apparatus so people can reproduce it. And I explain what experiments I’ve done and things I’ve observed while trying to understand my anomaly. This gives the community the ability to do their own experiments to try to understand it.

Final Remarks

Personally, I find this scientific approach to debugging helpful. Taking an iterative, hypothesis-and-evidence-based approach helps me in terms of actually getting bugs solved. Along the way, I often learn new things. Similarly, science has given us all sorts of practical things, while also expanding our knowledge of the universe.

I don’t mean to imply my approach is the only one, or the best one for all circumstances. There are almost certainly cases where net productivity, at least in the short term, benefits from a fix made without understanding. After all, we can put superconductors to use without fully understanding how they work. But we still experiment and try to learn, and odds are those experiments will pay off eventually.

The OPERA collaboration was actually very thorough in trying to understand their own anomaly before releasing the preprint. Despite that, the project leaders ended up resigning, and many people felt embarrassed (although the hyped-up press coverage deserves much of the blame).

Finally, I also hope the OPERA example reminds you that even smart people can miss stupid bugs. The next time I “publish” a dumb bug report, please don’t be too harsh on me.