Testability as observability and the Accessibility Object Model

I attended a talk today by Rob Dodson on some proposals for the Accessibility Object Model that are trying to add APIs for web developers to more easily manipulate accessibility features of their apps and pages. Rob went through quite a few examples of the benefits the proposed APIs would bring, from simple type checking to defining semantics of custom elements and user actions. Unsurprisingly, the one use case that stuck out for me was making the accessibility layer testable.

It’s commonly cited in accessibility circles that only a small fraction of potential accessibility issues can be checked automatically, typically around 20% or so. Colour contrast, alt tags on images, and page structure, for example. I’ve used the axe-core library for this in my own automation work and it’s been quite useful to flag potential issues. The 80% that can’t be checked as easily represents the wide range of human experiences, abilities, and intentions.

I doubt the proposed AOM APIs would tip that balance very much, but it did look like they’d add a useful standard way to get information about the accessibility properties on an element. This was especially important given the many points the semantics of a tag could be defined, how each interacts with the others. The example Rob gave for testability I could see being put into use something like:

const node = getComputedAccessibilityNode(...);

There was a long list of properties that would be exposed on this node, all of which could be checked in unit tests or any other context. This was really the only thing he talked about that directly touched on testing, which made me think: is observability the same as testability?

A quick search online for testability turned up a fair bit about design principles like SOLID, but principles like that seem only to be about making things less complex. I can imagine products that have simple internal architecture but are still intractable as a subject under test. The other half of resources online talk about testable from a scientific standpoint, which leads to falsifiability. I’m going to be completely simplistic for a moment and say that’s what the expect assertion guarantees: falsifiability by answers a true or false question. Whether it’s the right question, well…

I think there’s a good argument to make that observability is one of the most important dimensions of testability. If you can’t see what’s going on with a product, you have no hope of saying anything cognizant about it. Better APIs for querying an element certainly contribute to that.

Observability alone might have been all I had to work with as an astrophysicist, but with software we can, and should, do better. Testing is all about investigating what happens in different situations, which requires some kind of control over what is happening and when. The AOM APIs certainly give the developer a lot more control in defining the interactions in the first place. Some of the new semantic events could give a tester a different type of control for simulating user actions, though it sounded like they are still very tentative. The key, for me, is that any fancy custom accessible elements still need to provide ways of poking and prodding them from the context of an automated check to be testable. I not only need to see what happens, but I need to be able to see what happens given whatever arbitrary input my heart desires.

Unfortunately, classically, out of the five major motivators for the AOM proposals Rob reviewed, the testability work was explicitly at the bottom of the list. As a tester it’s always a bummer to hear that sort of thing, but I hope it only emphasizes how much of a marginal change it represents over existing ways of probing the accessibility attributes of a page. As long as they remain accessible that way, I can understand it. If, however, the accessibility information that gets hidden in the AOM—touted as one of the perks of the proposals—becomes inaccessible to tests in the meantime, then that’s a problem.

Testability in software = observability + control. What am I missing?

All your automated tests should fail

A crucial little caveat to my statement that automated tests aren’t automated if they don’t run automaticallyAll your automated tests should fail.

… at least once, anyway.

In the thick of implementing a new feature and coding up a bunch of tests for it, it’s easy to forget to run each one to make sure they’re actually testing something. It’s great to work in an environment where all you have to do is type make test or npm run test and see some green. But if you only ever see green, you’re missing a crucial step.

TDD understands this: red, green, refactor. It’s the first one. You write a test first, see it fail, and then write the code necessary to make it pass. You don’t have to be a die-hard TDD acolyte to do this, though.

Write a test. See it pass. Introduce the bug it was designed to catch. See it fail.

If you can’t make it fail, then what information is that test ever going to give you?

Even if you just reverse the expectation (add a .not or change true to false) and see the case fail, you at least know that you ran the test. This is why a catch-all npm run test is great but dangerous: It is often possible to write a test but not include it in the tests that run automatically, and fail to notice because it’s 1 out of 1000 tests. Quite likely this shouldn’t be possible, but seems to arise naturally in at least three frameworks I’ve worked on. Knowing that all tests passed isn’t the same as knowing that one specific test passed.

Aside from just checking the wrong thing, or failing to run the test at all, this concept can hit in more subtle ways. I recently saw an example while using Jest to test a React app. There were tests written to assert that if a certain condition was met, dispatching the action being tested would result in a rejected promise. Since promises resolve asynchronously, in order to test their results you need to wait until the promises are done before checking the results. In Jest, you do that by returning a promise from the test. These tests weren’t doing that:

it("rejects apples", () => {

By the time the makeOrangeJuice() call was rejected and the expect() failed, the it() function had already finished and declared a pass by default. If the call wasn’t getting rejected—and, spoiler alert, it wasn’t—the test would still pass but you’d see an UnhandledRejectedPromiseWarning error in the console if you happened to be watching carefully enough. Worse than that, Jest clears the console logs when it runs more than one suite at a time, so we wouldn’t have seen this error in the CI logs either.

Confusingly, the code was resolving a promise that should have been rejected, but the expect was being rejected (failing) when it should have resolved (pass). So not the most helpful error message unless you’re familiar with this sort of mistake. (“What do you mean unhandled rejection? I’m not rejecting anything! There’s a Promise.resolve() right here!”)

For the record, Jest is pretty good at testing asynchronously, if you remember to write your tests that way. All it takes in most cases is a return:

it("rejects apples", () => {
  return expect(makeOrangeJuice("apples")).rejects.toEqual();

That’s a test that should have failed, and actually was failing, but we would never see it fail.

Test your tests. Feed them with bugs so they grow up strong. Love any test that fails. If it can’t, it’s no good to anybody. All your automated tests should fail.

(Bonus tip: you can automate testing your tests! It’s called mutation testing, and though it generally doesn’t tell you if you have useless tests, it does insert bugs and tells you if your test suite as a whole will catch them or not. I have a great little demo for this that I will, one day, make into a video.)

Highlights from one day at Web Unleashed 2018

I pretended to be a Front-End Developer for the day on Monday and attended some sessions at FITC’s Web Unleashed conference. Here are some of the things I found interesting, paraphrasing my notes from each speaker’s talk.

Responsive Design – Beyond Our Devices

Ethan Marcotte talked about the shift from pages as the central element in web design to patterns. Beware the trap of “as goes the mockup, so goes the markup”. This means designing the priority of content, not just the layout. From there it’s easier to take a device-agnostic approach, where you start with the same baseline layout that works across all devices, and enhancing them from there based on the features supported. He wrapped up with a discussion of the value that having a style guide brings, and pointed to a roundup of tools for creating one by Susan Robertson, highlighting that centering on a common design language and naming our patterns helps us understand the intent and use them consistently.

I liked the example of teams struggling with the ambiguity in Atomic Design’s molecules and organisms because I have had the same problem the first time I saw it.

Think Like a Hacker – Avoiding Common Web Vulnerabilities

Kristina Balaam reviewed a few common web scripting vulnerabilities. The slides from her talk have demos of each attack on a dummy site. Having worked mostly in the back-end until relatively recently, cross-site scripting is still something I’m learning a lot about, so despite these being quite basic, I admit I spent a good portion of this talk thinking “I wonder if any of my stuff is vulnerable to this.” She pointed to OWASP as a great resource, especially their security audit checklists and code review guidelines. Their site is a bit of a mess to navigate but it’s definitely going into my library.

As a sidenote, her slides say security should “be paramount to QA”, though verbally she said “as paramount as QA”. Either way, this got me thinking about how it fits into customer-first testing, given that it’s often something that’s invisible to the user (until it’s painfully not). There may be a useful distinction there between modern testing as advocating for the user vs having all priorities set by the user, strongly dependent on the nature of that relationship.

Inclusive Form Inputs

Andréa Crofts gave several interesting examples (along with some do’s and don’ts) of how to make forms inclusive. The theme was generally to think of spectra, rather than binaries. Gender is the familiar one here, but something new to me was that you should offer the ability to select multiple options to allow a richer expression of gender identity. Certainly avoid “other” and anything else that creates one path for some people and another for everybody else. She pointed to an article on designing forms for gender by Sabrina Foncesca as a good reference. Also interesting was citizenship (there are many different legal statuses than just “citizen” and “non-citizen”) and the cultural assumptions that are built into the common default security questions. Most importantly: explain why you need the data at all, and talk to people about how to best ask for it. There are more resources on inclusive forms on her website.

Our Human Experience

Haris Mahmood had a bunch of great examples of how our biases creep into our work as developers. Google Translate, for one, treats the Turkish gender neutral pronoun “o” differently when used with historically male or female dominated jobs, just as a result of the learning algorithms being trained on historical texts. Failures in software recognizing people of dark skin was another poignant example. My takeaway: bias in means bias out.

My favourite question of the day came from a woman in the back: “how do you get white tech bros to give a shit?”

Prototyping for Speed & Scale

Finally, Carl Sziebert ran though a ton of design prototyping concepts. Emphasizing the range of possible fidelity in prototypes really helped to show how many different options there are to get fast feedback on our product ideas. Everything from low-fi paper sketches to high-fi user stories implemented in code (to be evaluated by potential users) can help us learn something. The Skeptic’s Guide to Low Fidelity Prototyping by Laura Busche might help convince people to try it, and Taylor Palmer’s list of every UX prototyping tool ever is sure to have anything you need for the fancier stages. (I’m particularly interested to take a closer look at Framer X for my React projects.)

He also talked about prototypes as a way to choose between technology stacks, as a compromise and collaboration tool, a way of identifying “cognitive friction” (repeated clicks and long time between actions to show that something isn’t behaving the way the user expects, for example), and a way of centering design around humans. All aspects that I want to start embracing. His slides have a lot of great visuals to go with these concepts.

Part of the fun of being at a front-end and design-focused conference was seeing how many common themes there are with the conversation happening in the testing space. Carl mentioned the “3-legged stool” metaphor that they use at Google—an engineer, a UX designer, and a PM—that is a clear cousin (at least in spirit if not by heritage) of the classic “3 amigos”—a business person, developer, and tester.

This will be all be good fodder to when a lead a UX round-table at the end of the month. You’d be forgiven for forgetting that I’m actually a tester.

99 character talk: Unautomated automated tests

Some of the testers over at testersio.slack.com were lamenting that it was quiet online today, probably due to some kind of special event that the rest of us weren’t so lucky to join. Some joked that we should have our own online conference without them, and Daniel Shaw had the idea for “99 character talks”, an online version of Ministry of Testing’s 99 second talks. Several people put contributions on Slack, and some made it into a thread for them the MoT Club.

The whole concept reminds me of a story I once heard about Richard Feynman. As I remember it, he was asked the following: If all civilization and scientific knowledge was about to be destroyed, but he could preserve one statement for future generations, what would he say to give them the biggest head start on rebuilding all that lost knowledge? His answer: “Everything is made of atoms.

Luckily the 99 character challenge wasn’t quite so dramatic as having to preserve the foundations of software testing before an apocalypse. So instead, for my contribution I captured a simpler bit of advice that has bit me in the ass recently:

Automated tests aren’t automated if they don’t run automatically. Don’t make devs ask for feedback.

99 characters exactly.

But because I can’t help myself, here’s a longer version:

The problem with automated tests*, as with any test, is that they can only give you feedback if you actually run them. I have a bunch of tests for a bunch of different aspects of the products I work on, but not all of them are run automatically as part of our build pipelines. This effectively means that they’re opt-in. If someone runs them, actively soliciting the feedback that these tests offer, then they’ll get results. But if nobody takes that active step, the tests do no good at all. One of the worst ways to find a bug, as I recently experienced, is to know that you had everything you needed to catch it in advance but just didn’t.

If your automated tests aren’t run automatically whenever something relevant changes, without requiring someone to manually trigger them, then they aren’t automated at all. An automated test that is run manually is still a manual test. It’s one that can be repeated reliably and probably faster than a human could do, sure, but the onus is still on a fallible bag of meat to run it. Build them into your pipeline and loop the feedback back to the programmer as soon as something goes wrong instead. More and more lately I’m realizing that building reliable and efficient pipelines is a key skill to get the most out of my testing.

Now to put my own advice into practice…

* The “tests vs checks” pedants in the room will have to be satisfied for the moment with the fact that writing “automated checks” would have make the talk 100 characters and therefore inadmissible. Sorry, but there were zero other ways to reduce the character count by 1.

Dynamically create test cases with Robot Framework

In Robot Framework, there isn’t an obvious built-in way to create a list of tests to execute dynamically. I recently faced a case where I wanted to do this, and happily Bryan Oakley (blog, twitter, github) was able to help me through the problem. I’ve seen a few people with similar problems so thought it would be useful to document the solution.

Use the subheadings to skip down to the solution if you don’t want the backstory.

Why would I want to do this

Normally I’m against too much “magic” in test automation. I don’t like to see expected values calculated or constructed with a function that’s just as likely to have bugs as the app being tested, for example. I’ve seen tests with assertions wrapped in for loops that never check whether we actually did greater than zero assertions. Helper functions have an if/else to check two variations of similar behaviour and the test passes, but I can’t tell which of the two cases it thinks it found or whether that was the intended one. When you write a test case you should know what you’re expecting, so expect it. Magic should not be trusted.

But sometimes I need a little magic.

The problem I had was that I wanted to check that some background code was executing properly every time the user selected an option from a list, but the items in that list could be changed by another team at any time. It wasn’t sufficient to check that one of the items worked, or that a series of fake items, because I wanted to know that the actual configuration of each item in the real list was consistent with what our code expected. I’m basically testing the integration, but I would summarize it like this: “I want to test that our code properly handles every production use case.”

Importantly, though, I don’t just care that at least one item failed, I care how many items failed and which ones. That’s the difference between looping over every item within a test case and executing a new case for each one. Arguably this is just a reporting problem, and certainly I can drill down into the reports if I did this all with a loop in one test case, but I would rather have the most relevant info front and center.

The standard (unmaintainable) solution

Robot Framework does provide a way of using Test Templates and for-loops to accomplish something like this: given a list, it can run the same test on each item in the list. For 10 items, the report will tell you 10 passed, 10 failed, or somewhere in between. This works well if you know in advance which items you need to test:

*** Settings ***
Test Template    Some test keyword

*** Test Cases ***
:FOR    ${i}    IN RANGE     10
\    ${i}

This runs Some test keyword ten times, using the numbers 0 to 9 as arguments, which you’d define to click on the item index given and make whatever assertions you need to make. Of course as soon as the list changes to 9 or 11 items, this will either fail or silently skip items. To get around this, I added a teardown step to count the number of items in the list and issue a failure if it didn’t match the expected list. Still not great.

The reporting still leaves a bit to be desired, as well. It’s nicer to list out each case with a descriptor, like so:

*** Test Cases ***
Apples     0
Oranges    1
Bananas    2

We get a nice report that tells us that Apples passed but Oranges and Bananas failed. Now I can easily find which thing failed without counting items down the list, but you can see that this is even more of a maintenance nightmare. As soon as the order changes, my report is lying to me.

A failed intermediate option

When I brought this question up to the Robot Framework slack user group, Bryan suggested I look into using Robot’s visitor model and pre-run modifiers. Immediately this was over my head. Not being a comp-sci person, this was the first I had heard of the visitor pattern, but being some who always wants to learn this immediately sent me down a Wikipedia rabbit hole of new terminology. The basic idea here, as I understand it, is to write a modifier that would change a test suite when it starts. Bryan provided this example:

from robot.api import SuiteVisitor

class MyVisitor(SuiteVisitor):

    def __init__(self):
    def start_suite(self, suite):
        for i in range(3):
            tc = suite.tests.create(name='Dynamic Test #%s' % i)
            tc.keywords.create(name='Log', args=['Hello from test case #%s' % i])

# to satisfy robot requirement that the class and filename
# are identical
visitor = MyVisitor

This would be saved in a file called “visitor.py”, and then used when executing the suite:

robot --prerunmodifier visitor.py existing_suite.robot

I ran into problems getting this working, and I didn’t like that the pre-run modifier would apply to every suite I was running. This was just one thing I wanted to do among many other tests. I didn’t want to have to isolate this from everything else to be executed in its own job.

My next step to make this more flexible was to adapt this code into a custom python keyword. That way, I could call it from a specific suite setup instead of every suite setup. The basic idea looked like this:

tc = BuiltIn()._context.suite.tests.create(name="new test")

but I couldn’t get past a TypeError being thrown from the first line, even if I was willing to accept the unsupported use of _context. While I was trying to debug that, Bryan suggested a better way.

Solution: Adding test cases with a listener

For this, we’re still going to write a keyword that uses suite.tests.create() to add test cases, but make use of Robot’s listener interface to plug into the suite setup (and avoid _context). Again, this code comes courtesy of Bryan Oakley, though I’ve changed the name of the class:

from __future__ import print_function
from robot.running.model import TestSuite

class DynamicTestCases(object):

    def __init__(self):
        self.ROBOT_LIBRARY_LISTENER = self
        self.current_suite = None

    def _start_suite(self, suite, result):
        # save current suite so that we can modify it later
        self.current_suite = suite

    def add_test_case(self, name, kwname, *args):
        """Adds a test case to the current suite

        'name' is the test case name
        'kwname' is the keyword to call
        '*args' are the arguments to pass to the keyword

            add_test_case  Example Test Case  
            ...  log  hello, world  WARN
        tc = self.current_suite.tests.create(name=name)
        tc.keywords.create(name=kwname, args=args)

# To get our class to load, the module needs to have a class
# with the same name of a module. This makes that happen:
globals()[__name__] = DynamicTestCases

This is how Bryan explained it:

It uses a couple of rarely used robot features. One, it uses listener interface #3, which passes actual objects to the listener methods. Second, it uses this listener as a library, which lets you mix both a listener and keywords in the same file. Listener methods begin with an underscore (eg: `_start_suite`), keywords are normal methods (eg: `add_test_case`). The key is for `start_suite` to save a reference to the current suite. Then, `add_test_case` can use that reference to change the current test case.

Once this was imported into my test suite as a library, I was able to write a keyword that would define the test cases I needed on suite setup:

Setup one test for each item
    ${numItems}=    Get number of items listed
    :FOR    ${i}    IN RANGE    ${numItems}
    \     Add test case    Item ${i}
    \     ...              Some test keyword    ${i}

The first line of the keyword gets the number of items available (using a custom keyword for brevity), saving us the worry of what happens when the list grows or shrinks; we always test exactly what is listed. The FOR loop then adds one test case to the suite for each item. In the reports, we’ll see the tests listed as “Item 0”, “Item 1”, etc, and each one will execute the keyword Some test keyword with each integer as an argument.

I jazzed this up a bit further:

Setup one test for each item
    ${numItems}=    Get number of items listed
    ${items}=       Get webelements    ${itemXpath}
    :FOR    ${i}    IN RANGE    ${numItems}
    \   ${itemText}=    Set variable
    \   ...             ${items[${i}].get_attribute("text")}
    \   Add test case   Item ${i}: ${itemText}
    \   ...             Some test keyword ${i}

By getting the text of the WebElement for each item, I can set a more descriptive name. With this, my report will have test cases name “Item 0: Apple”, “Item 1: Orange”, etc. Now the execution report will tell me at a glance how many items failed the test, and which ones, without having to count indices or drill down further to identify the failing item.

The one caveat to this is that Robot will complain if you have a test suite with zero test cases in it, so you still need to define one test cases even if it does nothing.

*** Settings ***
Library        DynamicTestCases
Suite setup    Setup one test for each item

*** Test cases ***
Placeholder test
    Log    Placeholder test required by Robot Framework

*** Keywords ****
Setup one test for each item

You can not, unfortunately, use that dummy test to run the keyword to add the other test cases. By the time we start executing tests, it’s too late to add more to the suite.

Since implementing the DynamicTestCases library, my suite has no longer been plagued with failures caused only by another team doing their job. I’m now testing exactly what is listed at any given moment, no more and no less. My reports actually give me useful numbers on what is happening, and they identify specifically where problems were arising. I still have some safety checks in place on teardown to ensure that I don’t fail to test anything at all, but these have not flagged a problem in weeks.

As long as there’s a good use case for this kind of magic, I hope it is useful to others as well.