Tests vs Checks and the complication of AI

There’s a lot to be made in testing literature of the differences between tests and checks. This seems to have been championed primarily by Michael Bolton and James Bach (with additional background information here), though it has not been without debate. I don’t have anything against the distinction such as it is, but I do think it’s interesting to look at what whether it’s really a robust one.

The definitions, just as a starting point, are given in their most recent iteration as:

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

These come with a host of caveats and clarifications; do go back and read the full explanation if you haven’t already. The difference does not seem intuitive from the words themselves, which may be why there is such debate. Indeed, I’ve never seen anybody other than testing professionals make the distinction, so in normal usage I almost exclusively hear “test” used, and never “check”. Something I might call an automated test, others might call—and insist that it be called—an automated (or machine) check. This is just a consequence of working day-to-day with developers, not with testing nerds who might care about the difference.

Along those lines, I also find it interesting that this statement, still quoting from James Bach’s blog:

One common problem in our industry is that checking is confused with testing. Our purpose here is to reduce that confusion.

goes by with little explanation. There is a clear desire to differentiate what a human can do and what a computer can do. The analogy in the preamble to craftspeople being replaced by factory workers tries to illustrate the problem, but I’m not sure it really works. The factory model also has advantages and requires it own, different, set of skilled workers. I may just be lucky in that I haven’t ever worked in an environment where I was under pressure to blindly automate everything and dismiss the value humans bring to the process, so I’ve never needed the linguistic backing to argue against that. This affords some privilege to wonder whether this distinction has come about only because of a desire to differentiate between what a computer and a human can do, or because there actually is a fundamental difference.

Obviously, as far as what is possible today, there is no argument. But the more we see AI coming into use in testing, the more difficult this distinction will become. If I have an AI that knows how to use different kind of apps, and I can give it an app without giving it any specific instructions, what is it doing? Can it ever be testing, or is it by definition only checking? There are AI products being pushed by vendors now that can report differences between builds of an app, though for now these don’t seem to be much more than a glorified diff tool or monitor for new error messages.

Nonetheless, it’s easy to imagine more and more advanced AIs that can better and better mimic what a real end user (or millions of end users) might do and how they would react. Maybe it can recognize UI elements or simulate the kinds of swipe gestures people make on a phone. Think of the sort of thing I, as a human user, might do when exploring a new app: click everywhere, swipe different ways, move things around, try changing all the settings to see what happens, etc. It’s all that exploration, experimentation, and observation that’s under the “testing” definition above, with some mental model of what I expect from each kind of interaction. I don’t think there’s anything there that an AI fundamentally can’t do, but even then, there would be some kind of report coming out the other end about what the AI found that would have to be evaluated and acted upon by a human. Is the act of running the AI itself the test, and every thing else it does just checks? If you’re the type that wants to say that “testing” by its nature can’t be automated, then do you just move the definition of testing to mean interpreting and acting on the results?

This passage addresses something along those lines, and seems to answer “yes”:

This is exactly parallel with the long established convention of distinguishing between “programming” and “compiling.” Programming is what human programmers do. Compiling is what a particular tool does for the programmer, even though what a compiler does might appear to be, technically, exactly what programmers do. Come to think of it, no one speaks of automated programming or manual programming. There is programming, and there is lots of other stuff done by tools. Once a tool is created to do that stuff, it is never called programming again.

Unfortunately “compiling” and “programming” are both a distracting choice of words for me (few of the tools I use involve compiling and the actual programming is the least interesting and least important step in producing software). More charitably, perhaps as more and more coding becomes automated (and it is), “programming” as used here becomes the act of deciding how to use those tools to get to some end result. When thinking about how the application of AI might confuse “tests” vs “checks”, this passage stuck out because it reminded me of another idea I’ve heard which I can only paraphrase: “It’s only called machine learning (or AI) until it works, then it’s just software”. Unfortunately I do not recall who said that or if I am even remembering the point correctly.

More to the point, James also notes in the comments:

Testing involves creative interpretation and analysis, and checking does not

This too seems to be a position that, as AI becomes more advanced and encroaches on areas previously thought to be exclusive to human thought, will be difficult to hold on to. Again, I’m not making the argument that an AI can replace a good tester any time soon, but I do think that sufficiently advanced tools will continue to do more and more of what we previous thought was not possible. Maybe the bar will be so high that expert tester AIs are never in high enough demand to be developed, but could we one day get to the point where the main responsibility a human “tester” has is checking the recommendations of tester AIs?

I think more likely the addition of real AIs to testing just means less checking that things work, and more focus on testing whether they actually do the right thing. Until AIs can predict what customers or users want better than the users themselves, us humans should still have plenty to do, but that distinction is a different one than just “test” vs “check”.

Yes, I test in production

Recently a post on Reddit about a company doing tests on a live production environment sparked some conversation on the Testers.io slack channel about whether “testing in production” is a wise idea or not. One Reddit user commented saying:

Rule number 1 of testing (i.m.o): DO IT ON A NON-PRODUCTION ENVIRONMENT.
Rule number 2 of testing (i.m.o): Make sure you are NOT, I repeat, NOT on a production environment.

Three years ago I might have agreed with that. Today, I absolutely don’t. Am I crazy?

never test in production… for some definitions of production

The original post describes a situation where some medical equipment stopped working over night. After much debugging and technical support, the cause was identified to be that the machines were remotely put into a special mode for testing by the vendor and not restored before the morning.

There’s unlikely anything controversial in saying that this wasn’t a good move on the vendor’s part. They were messing with something a customer was, or could have been, using. Without notifying them about it. Though it’s all the more egregious because of being medical equipment, any customer would be annoyed by this when they found out. But you can’t extend that in a blanket way to all kinds of production environments.

There is certainly a lesson to be learned here, but we will get more from it by being more specific. One might suggest any of:

  • Never test something your customer is already using
  • Never test in a way your customer will notice
  • Never test something your customer will notice without telling them

But I can think of counter-examples to each of these, and it boils down to a very simple observation.

If you never test x, you will miss things about x

If you never test in production, you’re robbing yourself of insights about production. You won’t necessarily miss bugs, but unless you have a test environment that mimics every aspect of production perfectly (and none of us do), there will be something that goes on in production that you won’t see.

This is what I didn’t understand four years ago when I started in this line of work. In my first testing job, we didn’t test in production, so why would we test in production? The naïveté of a newbie tester, eh?

Over time we got bit by this in many different ways. The most common category of issues were use cases happening in production that we simply didn’t anticipate. Of course, some of these you might find earlier by using production data in a test environment, but you will always be limited by your sample. A second category of issues would then arise around faulty assumptions in your test environment. It’s those little details that “shouldn’t affect testing” until they do. If you’re lucky you spot an error right away. If you’re less lucky you push a feature that quietly does nothing until somebody notices it isn’t there. If you’re really having a bad day it silently does the wrong thing.

It’s around this time that you start to catch on to the fact that you need to test new deployments, at the very least to verify that something is working “in the real world”. At this point you’re testing, to some degree, in production. Are we as far along as turning off medical equipment a customer is using? No, but we are already bending the “never test in production” rule.

This is just the beginning of a long list of reasons; iAmALittleTester has compiled a list of many similar scenarios where testing in production can provide information that you aren’t getting otherwise. All of these, though, count on the fact that you aren’t going to break anything by testing. (Maybe this is the crux: if you think that the job of a software tester is to break the software, then you likely think “testing in production” is synonymous with “breaking production”!)

What can you do safely?

One of the key differences between testing in a web or back-end server environment and hardware owned by a customer is that I can usually send requests to a server and examine the response without impacting any other requests hitting the same server. I probably can’t do that with customer hardware. The parallel would be that you probably shouldn’t use a customer’s account for your tests. You need to be careful about any state or statistics a system might be keeping track of. (Though I would raise an eyebrow at anything that needs to be aware that it is being tested. If you have “test mode” in your code, you’ve just created a mini test environment in a live program. Are you gaining any advantages from being in production in that case?)

Whether this works, and the types of things you can do, will depend on the issues you expect to find. I’m not going to try stress testing my production environment during peak traffic. If I suspect that a certain kind of request will corrupt the state of the server, I’m certainly not going to do that in production either. If my test has any chance of having a negative effect on a user, I’m not going to risk that. But on a web app, one more anonymous request should be no different from what “real” users are sending the app. And on the subject of “real” users…

I’m not just a QA, I’m also a user

This is the aspect that I’ve found most useful about testing a production environment. Much like experience is often best gained by doing, knowledge of how a product works may be best gained by using it. If you only use a product in a test environment, then you only know about how your product works in a test environment. There are lots of insights about a product that can’t come from simply using it, of course, and in some cases it isn’t realistic to expect to be able to use a product as much as the intended users. But if it possible, if you can make it possible, then it is an opportunity to see things in a different way.

When I do get to use something I’m working on like an end-user does, on some level I’m always testing it. It’s not a big leap to say that your end users are, on some level, testing your product every time they use it. If your users are testing in production, why aren’t you?


Addendum: Rosie Sherry pulled together some resources on testing in production over at the MoT Club that are definitely worth checking out. Chaos Monkey from Netflix is one really interesting way of testing your production setup that I meant to mention in this post, but was only reminded of again when I came across this thread.

How do you know you actually did something?

When I was defending my thesis, one of the most memorable questions I was asked was:

“How do you know you actually did something?”

It was not only an important question about my work at the time, but has also been a very useful question for me in at least two ways. In a practical sense, it comes up as something like “did we actually fix that bug?” More foundationally, it can be as simple as “what is this test for?”

Did it actually work?

The original context was a discussion about the efforts I had gone through to remove sources of noise in my data. As I talked about in my previous post, I was using radio telescopes to measure the hydrogen distribution in the early universe. It was a very difficult measurement because even in the most optimistic models it was expected to be several orders of magnitude dimmer than everything else the telescopes could see. Not only did we have to filter out radio emission from things in the sky we weren’t interested in, it should not be surprising that there’s also a whole lot of radio coming from Earth. Although we weren’t actually looking in the FM band, it would be a lot like trying to pick out some faint static in the background of the local classic rock radio station.

One of the reasons these telescopes were built in rural India was because there was relatively little radio in use in the area. Some of it we couldn’t do anything about, but it turned out that a fair amount of radio noise in the area was accidental. The most memorable example was a stray piece of wire that had somehow been caught on some high voltage power lines and was essentially acting like an antenna leaking power from the lines and broadcasting it across the country.

We developed a way of using the telescopes to scan the horizon and for bright sources of radio signals on the ground and triangulate their location. I actually spent as much time wandering through the countryside with a GPS device in one hand and a radio antenna in the other trying to find these sources. This is what led to what has since become the defining photo of my graduate school experience:

Standing in a field with a radio antenna, next to a cow
Cows don’t care, they have work to do.

Getting back to the question at hand, after spending weeks wandering through fields tightening loose connections, wrapping things in radio shielding, and getting the local power company to clean wires of their transmission lines… did we actually fix anything? Did we reduce the noise in our data? Did we make it easier to see the hydrogen signal we were after?

Did we actually fix the bug?

In many ways, getting rid of those errant radio emitters was like removing bugs in data. Noisy little things that were there only accidentally and that we could blame for at least some of the the noise in our data.

But these weren’t the bugs that you find before you release. Those are the bugs you find because you anticipated that they would come up and planned your testing around it. These things, in contrast, were already out in the field. They were the kinds of bugs that come from the user or something weird someone notices in the logs. You don’t know what caused them, you didn’t uncover them in testing, and you’re not sure at first what is triggering them. These are the ones that are met only with a hunch, an idea that “this might have to do with X” and “we should try doing Y and that might fix it.”

But how do you actually know that you’ve fixed it? Ideally you should have a test for it, but coming up with a new test that will catch a regression and seeing it pass isn’t enough. The test needs to fail. If it doesn’t you can’t show that fixing the bug actually causes it to pass. If you aren’t able to see the issue in the first place, it doesn’t tell you anything if you make a fix and then still don’t see the issue.

For us in the field, the equivalent reproducing the bug was going out with an antenna, pointing it at what we thought was a source, and hearing static on a handheld radio. One step after fixing it (or after the power company told us they fixed it) was to go out with the same antenna as see if the noise had gone away or not. The next step was turning on the antennas and measuring the noise again; push the fix to production and see what happens.

What is this test for?

Where this can go wrong — whether you know there’s a bug there or not — is when you have a test that doesn’t actually test anything useful. The classic example is an automated test that doesn’t actually check anything, but it can just as easily be the test that checks the wrong thing, the test that doesn’t check what it claims, or even the test that doesn’t check anything different from another one.

To me this is just like asking “did you actually do something”, because running tests that don’t actually check anything useful don’t do anything useful. If your tests don’t fail when there are bugs, then your tests don’t work.

In a world where time is limited and context is king, whether you can articulate what a test is for can be a useful heuristic for deciding whether or not something is worth doing. It’s much trickier than knowing whether you fixed a specific bug, though. We could go out into the field and hear clearly on the radio that a noise source had been fixed, but it was much harder to answer whether it paid off for our research project overall. Was it worth investing more time into it or not?

How do you know whether a test is worth doing?

How I got into testing

In my first post, I talked a bit about why I decided to start this blog. I often get asked how I ended up in testing given my previous career seems so different, so I thought I would step back a few years and talk about what made testing such a good fit for me.

Before my first job in software testing, this is where I used to work:

The Giant Metrewave Radio TelescopeOr at least, that’s where I worked at least a few weeks out of the year while I was collecting data for my research. Before software testing, I was as astrophysicist.

My research involved using the Giant Metrewave Radio Telescope — three antennas of which are pictured above — to study the distribution of hydrogen gas billions of years ago. I was trying to study the universe’s transition from the “Dark Ages” before the first stars formed to the age of light that we know today. Though I didn’t know that what I was doing had anything to do with software testing (or even that “software testing” was its own thing), this is where I was honing the skills that I would need when I changed careers. There are two major reasons for that.

To completely over simplify, the first reason was that I spent a lot of time dealing with really buggy software.

Debugging data pipelines

At the end of the day we were trying to measure one number that nobody had ever measured before using methods nobody had ever tried. That’s what science is all about! What this meant on a practical level was that we had to figure out a way of recording data and processing it using almost entirely custom software. There were packages to do all the fundamental math for us, and the underlying scientific theory was well understood, but it was up to us to build the pipeline that would turn voltages on those radio antennas into the single temperature measurement we wanted.

With custom software, of course, comes custom bugs.

A lot of the code was already established by the research group before I took over, so I basically became the product owner and sole developer of a legacy system without any documentation (not even comments) on day one, and was tasked with extending it into new science without any guarantee that it actually worked in the first place. And believe me, it didn’t. I had signed up for an astrophysics program, but here I was learning how to debug Fortran.

I never got as far as writing explicit “tests”, but I certainly did have to test everything. Made a change to the code? Run the data through again and see if it comes out the same. Getting a weird result? Put through some simple data and see if something sensible comes out. Your 6-day long data reduction pipeline is crashing halfway through one out of every ten times? Requisition some nodes on the computing cluster, learn how to run a debugger, and better hope you don’t have anything else to do for the next week. If I didn’t find and fix the bugs, my research would either be meaningless or take unreasonably long to complete.

The second reason this experience set me up well for testing was that testing and science, believe it or not, are both all about asking questions and running experiments to find the answers.

Experiments are tests

I got into science because I wanted to know more about how the world worked. As a kid, I loved learning why prisms made rainbows and what made the pitch of a race car engine change as it drove by. Can you put the rainbow back in and get white light back out? What happens if the light hitting the prism isn’t white? How fast does the car have to go to break the sound barrier? What if the temperature of the air changes? What happens if the car goes faster than light? The questions got more complicated as I got more degrees under my belt, but the motivation was the same. What happens if we orient the telescopes differently? Or point at a different patch of sky? Get data at a different time of day? Add this new step to the data processing? How about visualizing the data between two steps?

When I left academia, the first company that hired me actually brought me on as a data engineer, since I had experience dealing with hundreds of terabytes at a time. The transition from “scientist” to “data scientists” seemed like it should be easy. But within the first week of training, I had asked so many questions and poked at their systems from so many different directions that they asked if I would consider switching to the test team. I didn’t see anything special about the way I was exploring their system and thinking up new scenarios to try, but they saw someone who knew how to test. What happens if you turn this switch off? What if I set multiple values for this one? What if I start putting things into these columns that you left out of the training notes? What if these two inputs disagree with each other? Why does the system let me input inconsistent data at all?

I may not have learned how to ask those questions because of my experience in science, but that’s the kind of thinking that you need both in a good scientists and in a good tester. I didn’t really know a thing about software engineering, but with years of teaching myself how to code and debug unfamiliar software I was ready to give it a shot.

Without knowing it, I had always been a tester. The only thing that really changed was that now I was testing software.

Introduction about calling your shots

Oh good, yet another blog about software testing.

I want to start by introducing why I decided to start blogging about an area of work that seems to have no shortage of blogs and communities and podcasts and companies all clamoring for attention. As a relative newbie on the scene, how much new is there that I can add to an already very active conversation?

I came into testing as a profession in 2014, just by being in the right place at the right time. It wasn’t something I planned on doing, and not something I had any training in. I’ve taken precisely one computer science course, 10 years ago. In the meantime, I had been pursuing an academic career in physics. I’ve had a lot of catching up to do.

In science, you can go to any one of a hundred introductory textbooks and start learning the same fundamentals. There are books on electromagnetism that all have Maxwell’s equations and there’s classes on quantum mechanics that all talk about Hamiltonians. There’s really only one “physics” until you get the bleeding edge of it, and even then there’s an underlying assumption that even where there are different competing ideas, they’ll eventually converge on the same truth. We all agree that we’re studying the same universe.

Software testing is nothing like that.

We all do software testing, but none of us are testing the same software. Even though we use a lot of the same terms, there are as many ideas about what they mean as testers using them. There’s a vast array of different ideas about just about every aspect of what we do. That’s part of what makes it exciting! But it also makes it difficult to feel like I know what I’m doing. How do I actually learn about a discipline that has so much information in so many different places with so many different perspectives without just completely overwhelming myself?

That’s where curling comes in.

Curling rocks in play
Photo by Benson Kua

In case you didn’t already know that I’m Canadian, I’m also a curler. In a lot of ways, curling is physics-as-sport. And what does it have to do with blogging or testing?

Curling is all about sliding rocks down over 100 feet (30 meters) of ice and having them land in the right place. The two biggest variables are simply the direction and how fast you throw it. Once it’s out of the thrower’s hands, it’s the job of the sweepers to tell if it’s going the right speed to stop in the right place or not, and the job of the skip at the far end of the ice to watch if it’s going in the right direction. They need to communicate, since if either one of those variables is off, the sweepers can brush the ice to affect where the rock goes.

via GIPHY

As the guy who’s walking down the ice trying guess where this thing is going to land 100 feet from now, I can tell you it’s damn hard to get that right. When I first started playing, it was very easy to escort 47 rocks down the ice and still not have any idea where the 48th was going to land until I got a very simple, but oddly frightening, piece of advice.

Just commit to something.

Experienced curlers have a system for communicating how fast a rock is moving by shouting a number from 1 to 10. A “four” means it’s going to stop at the top of the outermost ring. A “seven” means it’ll be right on the button (a bulls-eye, so to speak). I knew this system and I would think about the numbers in my head as I walked beside those rocks, but it wasn’t until I started committing to specific numbers by calling them out to my team that I started to actually get it.

What made it frightening was that those first few times I called out a number I was way off. And I knew that I was going to be way off. I knew I stood a good chance of being wrong, loudly, in front of everybody else on the ice. But by doing it, I actually started to see how the end result compared to what I committed to. Not in the wishy-washy way I did when I would run through those numbers in my head (“that’s about what I would have guessed), but in a concrete way. It’s similar to how you think you know all the answers when watching Jeopardy, but it’s a lot harder when you have to say the answers out loud. I started to think through the numbers more, pay more attention to how the rocks were moving, committed out loud to something, and took in the feedback to learn something.

Can you see where I’m going with this now?

Even though software testing blogs are a dime a dozen, if I want to actually become an expert in this field I think it’s time to start forcing myself to get my thoughts together and commit to something.

My goal with this blog, then, is to think through testing concepts and my experiences and commit those thoughts to paper. I’m not going to try to explain basic terms as if I’m an authority, but I might try to talk through whether some of those concepts are useful to me or not and how I see them actually being used. I plan to talk about my experiences and views as a tester, as “a QA”, as a developer, and and as a member of this community, so that I can commit to growing as a professional.

If nobody else reads this it’ll still be a useful exercise for myself, but I do hope that there’s occasionally a skip on the other end of the ice who’ll hear my “IT’S A TWO!” and shout back “OBVIOUSLY TEN!”