How do you know you actually did something?

When I was defending my thesis, one of the most memorable questions I was asked was:

“How do you know you actually did something?”

It was not only an important question about my work at the time, but has also been a very useful question for me in at least two ways. In a practical sense, it comes up as something like “did we actually fix that bug?” More foundationally, it can be as simple as “what is this test for?”

Did it actually work?

The original context was a discussion about the efforts I had gone through to remove sources of noise in my data. As I talked about in my previous post, I was using radio telescopes to measure the hydrogen distribution in the early universe. It was a very difficult measurement because even in the most optimistic models it was expected to be several orders of magnitude dimmer than everything else the telescopes could see. Not only did we have to filter out radio emission from things in the sky we weren’t interested in, it should not be surprising that there’s also a whole lot of radio coming from Earth. Although we weren’t actually looking in the FM band, it would be a lot like trying to pick out some faint static in the background of the local classic rock radio station.

One of the reasons these telescopes were built in rural India was because there was relatively little radio in use in the area. Some of it we couldn’t do anything about, but it turned out that a fair amount of radio noise in the area was accidental. The most memorable example was a stray piece of wire that had somehow been caught on some high voltage power lines and was essentially acting like an antenna leaking power from the lines and broadcasting it across the country.

We developed a way of using the telescopes to scan the horizon and for bright sources of radio signals on the ground and triangulate their location. I actually spent as much time wandering through the countryside with a GPS device in one hand and a radio antenna in the other trying to find these sources. This is what led to what has since become the defining photo of my graduate school experience:

Standing in a field with a radio antenna, next to a cow
Cows don’t care, they have work to do.

Getting back to the question at hand, after spending weeks wandering through fields tightening loose connections, wrapping things in radio shielding, and getting the local power company to clean wires of their transmission lines… did we actually fix anything? Did we reduce the noise in our data? Did we make it easier to see the hydrogen signal we were after?

Did we actually fix the bug?

In many ways, getting rid of those errant radio emitters was like removing bugs in data. Noisy little things that were there only accidentally and that we could blame for at least some of the the noise in our data.

But these weren’t the bugs that you find before you release. Those are the bugs you find because you anticipated that they would come up and planned your testing around it. These things, in contrast, were already out in the field. They were the kinds of bugs that come from the user or something weird someone notices in the logs. You don’t know what caused them, you didn’t uncover them in testing, and you’re not sure at first what is triggering them. These are the ones that are met only with a hunch, an idea that “this might have to do with X” and “we should try doing Y and that might fix it.”

But how do you actually know that you’ve fixed it? Ideally you should have a test for it, but coming up with a new test that will catch a regression and seeing it pass isn’t enough. The test needs to fail. If it doesn’t you can’t show that fixing the bug actually causes it to pass. If you aren’t able to see the issue in the first place, it doesn’t tell you anything if you make a fix and then still don’t see the issue.

For us in the field, the equivalent reproducing the bug was going out with an antenna, pointing it at what we thought was a source, and hearing static on a handheld radio. One step after fixing it (or after the power company told us they fixed it) was to go out with the same antenna as see if the noise had gone away or not. The next step was turning on the antennas and measuring the noise again; push the fix to production and see what happens.

What is this test for?

Where this can go wrong — whether you know there’s a bug there or not — is when you have a test that doesn’t actually test anything useful. The classic example is an automated test that doesn’t actually check anything, but it can just as easily be the test that checks the wrong thing, the test that doesn’t check what it claims, or even the test that doesn’t check anything different from another one.

To me this is just like asking “did you actually do something”, because running tests that don’t actually check anything useful don’t do anything useful. If your tests don’t fail when there are bugs, then your tests don’t work.

In a world where time is limited and context is king, whether you can articulate what a test is for can be a useful heuristic for deciding whether or not something is worth doing. It’s much trickier than knowing whether you fixed a specific bug, though. We could go out into the field and hear clearly on the radio that a noise source had been fixed, but it was much harder to answer whether it paid off for our research project overall. Was it worth investing more time into it or not?

How do you know whether a test is worth doing?