All your automated tests should fail

A crucial little caveat to my statement that automated tests aren’t automated if they don’t run automaticallyAll your automated tests should fail.

… at least once, anyway.

In the thick of implementing a new feature and coding up a bunch of tests for it, it’s easy to forget to run each one to make sure they’re actually testing something. It’s great to work in an environment where all you have to do is type make test or npm run test and see some green. But if you only ever see green, you’re missing a crucial step.

99 character talk: Unautomated automated tests

Some of the testers over at testersio.slack.com were lamenting that it was quiet online today, probably due to some kind of special event that the rest of us weren’t so lucky to join. Some joked that we should have our own online conference without them, and Daniel Shaw had the idea for “99 character talks”, an online version of Ministry of Testing’s 99 second talks. Several people put contributions on Slack, and some made it into a thread for them the MoT Club.

The whole concept reminds me of a story I once heard about Richard Feynman. As I remember it, he was asked the following: If all civilization and scientific knowledge was about to be destroyed, but he could preserve one statement for future generations, what would he say to give them the biggest head start on rebuilding all that lost knowledge? His answer: “Everything is made of atoms.

Luckily the 99 character challenge wasn’t quite so dramatic as having to preserve the foundations of software testing before an apocalypse. So instead, for my contribution I captured a simpler bit of advice that has bit me in the ass recently:

Automated tests aren’t automated if they don’t run automatically. Don’t make devs ask for feedback.

99 characters exactly.

But because I can’t help myself, here’s a longer version:

The problem with automated tests*, as with any test, is that they can only give you feedback if you actually run them. I have a bunch of tests for a bunch of different aspects of the products I work on, but not all of them are run automatically as part of our build pipelines. This effectively means that they’re opt-in. If someone runs them, actively soliciting the feedback that these tests offer, then they’ll get results. But if nobody takes that active step, the tests do no good at all. One of the worst ways to find a bug, as I recently experienced, is to know that you had everything you needed to catch it in advance but just didn’t.

If your automated tests aren’t run automatically whenever something relevant changes, without requiring someone to manually trigger them, then they aren’t automated at all. An automated test that is run manually is still a manual test. It’s one that can be repeated reliably and probably faster than a human could do, sure, but the onus is still on a fallible bag of meat to run it. Build them into your pipeline and loop the feedback back to the programmer as soon as something goes wrong instead. More and more lately I’m realizing that building reliable and efficient pipelines is a key skill to get the most out of my testing.

Now to put my own advice into practice…


* The “tests vs checks” pedants in the room will have to be satisfied for the moment with the fact that writing “automated checks” would have make the talk 100 characters and therefore inadmissible. Sorry, but there were zero other ways to reduce the character count by 1.

Dynamically create test cases with Robot Framework

In Robot Framework, there isn’t an obvious built-in way to create a list of tests to execute dynamically. I recently faced a case where I wanted to do this, and happily Bryan Oakley (blog, twitter, github) was able to help me through the problem. I’ve seen a few people with similar problems so thought it would be useful to document the solution.

Use the subheadings to skip down to the solution if you don’t want the backstory.

Why would I want to do this

Normally I’m against too much “magic” in test automation. I don’t like to see expected values calculated or constructed with a function that’s just as likely to have bugs as the app being tested, for example. I’ve seen tests with assertions wrapped in for loops that never check whether we actually did greater than zero assertions. Helper functions have an if/else to check two variations of similar behaviour and the test passes, but I can’t tell which of the two cases it thinks it found or whether that was the intended one. When you write a test case you should know what you’re expecting, so expect it. Magic should not be trusted.

But sometimes I need a little magic.

The problem I had was that I wanted to check that some background code was executing properly every time the user selected an option from a list, but the items in that list could be changed by another team at any time. It wasn’t sufficient to check that one of the items worked, or that a series of fake items, because I wanted to know that the actual configuration of each item in the real list was consistent with what our code expected. I’m basically testing the integration, but I would summarize it like this: “I want to test that our code properly handles every production use case.”

Importantly, though, I don’t just care that at least one item failed, I care how many items failed and which ones. That’s the difference between looping over every item within a test case and executing a new case for each one. Arguably this is just a reporting problem, and certainly I can drill down into the reports if I did this all with a loop in one test case, but I would rather have the most relevant info front and center.

The standard (unmaintainable) solution

Robot Framework does provide a way of using Test Templates and for-loops to accomplish something like this: given a list, it can run the same test on each item in the list. For 10 items, the report will tell you 10 passed, 10 failed, or somewhere in between. This works well if you know in advance which items you need to test:

*** Settings ***
Test Template    Some test keyword

*** Test Cases ***
:FOR    ${i}    IN RANGE     10
\    ${i}

This runs Some test keyword ten times, using the numbers 0 to 9 as arguments, which you’d define to click on the item index given and make whatever assertions you need to make. Of course as soon as the list changes to 9 or 11 items, this will either fail or silently skip items. To get around this, I added a teardown step to count the number of items in the list and issue a failure if it didn’t match the expected list. Still not great.

The reporting still leaves a bit to be desired, as well. It’s nicer to list out each case with a descriptor, like so:

*** Test Cases ***
Apples     0
Oranges    1
Bananas    2

We get a nice report that tells us that Apples passed but Oranges and Bananas failed. Now I can easily find which thing failed without counting items down the list, but you can see that this is even more of a maintenance nightmare. As soon as the order changes, my report is lying to me.

A failed intermediate option

When I brought this question up to the Robot Framework slack user group, Bryan suggested I look into using Robot’s visitor model and pre-run modifiers. Immediately this was over my head. Not being a comp-sci person, this was the first I had heard of the visitor pattern, but being some who always wants to learn this immediately sent me down a Wikipedia rabbit hole of new terminology. The basic idea here, as I understand it, is to write a modifier that would change a test suite when it starts. Bryan provided this example:

from robot.api import SuiteVisitor

class MyVisitor(SuiteVisitor):

    def __init__(self):
        pass
    
    def start_suite(self, suite):
        for i in range(3):
            tc = suite.tests.create(name='Dynamic Test #%s' % i)
            tc.keywords.create(name='Log', args=['Hello from test case #%s' % i])


# to satisfy robot requirement that the class and filename
# are identical
visitor = MyVisitor

This would be saved in a file called “visitor.py”, and then used when executing the suite:

robot --prerunmodifier visitor.py existing_suite.robot

I ran into problems getting this working, and I didn’t like that the pre-run modifier would apply to every suite I was running. This was just one thing I wanted to do among many other tests. I didn’t want to have to isolate this from everything else to be executed in its own job.

My next step to make this more flexible was to adapt this code into a custom python keyword. That way, I could call it from a specific suite setup instead of every suite setup. The basic idea looked like this:

tc = BuiltIn()._context.suite.tests.create(name="new test")
tc.keywords.create(...)

but I couldn’t get past a TypeError being thrown from the first line, even if I was willing to accept the unsupported use of _context. While I was trying to debug that, Bryan suggested a better way.

Solution: Adding test cases with a listener

For this, we’re still going to write a keyword that uses suite.tests.create() to add test cases, but make use of Robot’s listener interface to plug into the suite setup (and avoid _context). Again, this code comes courtesy of Bryan Oakley, though I’ve changed the name of the class:

from __future__ import print_function
from robot.running.model import TestSuite


class DynamicTestCases(object):
    ROBOT_LISTENER_API_VERSION = 3
    ROBOT_LIBRARY_SCOPE = 'TEST SUITE'

    def __init__(self):
        self.ROBOT_LIBRARY_LISTENER = self
        self.current_suite = None

    def _start_suite(self, suite, result):
        # save current suite so that we can modify it later
        self.current_suite = suite

    def add_test_case(self, name, kwname, *args):
        """Adds a test case to the current suite

        'name' is the test case name
        'kwname' is the keyword to call
        '*args' are the arguments to pass to the keyword

        Example:
            add_test_case  Example Test Case  
            ...  log  hello, world  WARN
        """
        tc = self.current_suite.tests.create(name=name)
        tc.keywords.create(name=kwname, args=args)

# To get our class to load, the module needs to have a class
# with the same name of a module. This makes that happen:
globals()[__name__] = DynamicTestCases

This is how Bryan explained it:

It uses a couple of rarely used robot features. One, it uses listener interface #3, which passes actual objects to the listener methods. Second, it uses this listener as a library, which lets you mix both a listener and keywords in the same file. Listener methods begin with an underscore (eg: `_start_suite`), keywords are normal methods (eg: `add_test_case`). The key is for `start_suite` to save a reference to the current suite. Then, `add_test_case` can use that reference to change the current test case.

Once this was imported into my test suite as a library, I was able to write a keyword that would define the test cases I needed on suite setup:

Setup one test for each item
    ${numItems}=    Get number of items listed
    :FOR    ${i}    IN RANGE    ${numItems}
    \     Add test case    Item ${i}
    \     ...              Some test keyword    ${i}

The first line of the keyword gets the number of items available (using a custom keyword for brevity), saving us the worry of what happens when the list grows or shrinks; we always test exactly what is listed. The FOR loop then adds one test case to the suite for each item. In the reports, we’ll see the tests listed as “Item 0”, “Item 1”, etc, and each one will execute the keyword Some test keyword with each integer as an argument.

I jazzed this up a bit further:

Setup one test for each item
    ${numItems}=    Get number of items listed
    ${items}=       Get webelements    ${itemXpath}
    :FOR    ${i}    IN RANGE    ${numItems}
    \   ${itemText}=    Set variable
    \   ...             ${items[${i}].get_attribute("text")}
    \   Add test case   Item ${i}: ${itemText}
    \   ...             Some test keyword ${i}

By getting the text of the WebElement for each item, I can set a more descriptive name. With this, my report will have test cases name “Item 0: Apple”, “Item 1: Orange”, etc. Now the execution report will tell me at a glance how many items failed the test, and which ones, without having to count indices or drill down further to identify the failing item.

The one caveat to this is that Robot will complain if you have a test suite with zero test cases in it, so you still need to define one test cases even if it does nothing.

*** Settings ***
Library        DynamicTestCases
Suite setup    Setup one test for each item

*** Test cases ***
Placeholder test
    Log    Placeholder test required by Robot Framework

*** Keywords ****
Setup one test for each item
    ...

You can not, unfortunately, use that dummy test to run the keyword to add the other test cases. By the time we start executing tests, it’s too late to add more to the suite.

Since implementing the DynamicTestCases library, my suite has no longer been plagued with failures caused only by another team doing their job. I’m now testing exactly what is listed at any given moment, no more and no less. My reports actually give me useful numbers on what is happening, and they identify specifically where problems were arising. I still have some safety checks in place on teardown to ensure that I don’t fail to test anything at all, but these have not flagged a problem in weeks.

As long as there’s a good use case for this kind of magic, I hope it is useful to others as well.

Tests vs Checks and the complication of AI

There’s a lot to be made in testing literature of the differences between tests and checks. This seems to have been championed primarily by Michael Bolton and James Bach (with additional background information here), though it has not been without debate. I don’t have anything against the distinction such as it is, but I do think it’s interesting to look at what whether it’s really a robust one.

The definitions, just as a starting point, are given in their most recent iteration as:

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

These come with a host of caveats and clarifications; do go back and read the full explanation if you haven’t already. The difference does not seem intuitive from the words themselves, which may be why there is such debate. Indeed, I’ve never seen anybody other than testing professionals make the distinction, so in normal usage I almost exclusively hear “test” used, and never “check”. Something I might call an automated test, others might call—and insist that it be called—an automated (or machine) check. This is just a consequence of working day-to-day with developers, not with testing nerds who might care about the difference.

Along those lines, I also find it interesting that this statement, still quoting from James Bach’s blog:

One common problem in our industry is that checking is confused with testing. Our purpose here is to reduce that confusion.

goes by with little explanation. There is a clear desire to differentiate what a human can do and what a computer can do. The analogy in the preamble to craftspeople being replaced by factory workers tries to illustrate the problem, but I’m not sure it really works. The factory model also has advantages and requires it own, different, set of skilled workers. I may just be lucky in that I haven’t ever worked in an environment where I was under pressure to blindly automate everything and dismiss the value humans bring to the process, so I’ve never needed the linguistic backing to argue against that. This affords some privilege to wonder whether this distinction has come about only because of a desire to differentiate between what a computer and a human can do, or because there actually is a fundamental difference.

Obviously, as far as what is possible today, there is no argument. But the more we see AI coming into use in testing, the more difficult this distinction will become. If I have an AI that knows how to use different kind of apps, and I can give it an app without giving it any specific instructions, what is it doing? Can it ever be testing, or is it by definition only checking? There are AI products being pushed by vendors now that can report differences between builds of an app, though for now these don’t seem to be much more than a glorified diff tool or monitor for new error messages.

Nonetheless, it’s easy to imagine more and more advanced AIs that can better and better mimic what a real end user (or millions of end users) might do and how they would react. Maybe it can recognize UI elements or simulate the kinds of swipe gestures people make on a phone. Think of the sort of thing I, as a human user, might do when exploring a new app: click everywhere, swipe different ways, move things around, try changing all the settings to see what happens, etc. It’s all that exploration, experimentation, and observation that’s under the “testing” definition above, with some mental model of what I expect from each kind of interaction. I don’t think there’s anything there that an AI fundamentally can’t do, but even then, there would be some kind of report coming out the other end about what the AI found that would have to be evaluated and acted upon by a human. Is the act of running the AI itself the test, and every thing else it does just checks? If you’re the type that wants to say that “testing” by its nature can’t be automated, then do you just move the definition of testing to mean interpreting and acting on the results?

This passage addresses something along those lines, and seems to answer “yes”:

This is exactly parallel with the long established convention of distinguishing between “programming” and “compiling.” Programming is what human programmers do. Compiling is what a particular tool does for the programmer, even though what a compiler does might appear to be, technically, exactly what programmers do. Come to think of it, no one speaks of automated programming or manual programming. There is programming, and there is lots of other stuff done by tools. Once a tool is created to do that stuff, it is never called programming again.

Unfortunately “compiling” and “programming” are both a distracting choice of words for me (few of the tools I use involve compiling and the actual programming is the least interesting and least important step in producing software). More charitably, perhaps as more and more coding becomes automated (and it is), “programming” as used here becomes the act of deciding how to use those tools to get to some end result. When thinking about how the application of AI might confuse “tests” vs “checks”, this passage stuck out because it reminded me of another idea I’ve heard which I can only paraphrase: “It’s only called machine learning (or AI) until it works, then it’s just software”. Unfortunately I do not recall who said that or if I am even remembering the point correctly.

More to the point, James also notes in the comments:

Testing involves creative interpretation and analysis, and checking does not

This too seems to be a position that, as AI becomes more advanced and encroaches on areas previously thought to be exclusive to human thought, will be difficult to hold on to. Again, I’m not making the argument that an AI can replace a good tester any time soon, but I do think that sufficiently advanced tools will continue to do more and more of what we previous thought was not possible. Maybe the bar will be so high that expert tester AIs are never in high enough demand to be developed, but could we one day get to the point where the main responsibility a human “tester” has is checking the recommendations of tester AIs?

I think more likely the addition of real AIs to testing just means less checking that things work, and more focus on testing whether they actually do the right thing. Until AIs can predict what customers or users want better than the users themselves, us humans should still have plenty to do, but that distinction is a different one than just “test” vs “check”.