Agile Testing book club: Goals

This is the fourth part in my series highlighting some of the lessons I’m taking from reading Agile Testing by Lisa Crispin and Janet Gregory. Other entries in the series can be found here.

Chapter 5 had a lot of interesting things about metrics and strategy that I’ve taken notes on and have planned to incorporate into my work. I’m working on another post about metrics that will probably draw on some of the things discussed, but before I get into the nitty gritty of metrics I want to stay a bit higher level. The part that I want to focus on now was about having a goal or a purpose.

When you’re trying to figure out what to measure, first understand what problem you’re trying to solve.
— Agile Testing by Lisa Crispin and Janet Gregory, Chapter 5

This is very much connected to other ideas I’ve talked about before, like letting the team feel pain; pain provides a tangible goal and gives anything you do to address it a clear purpose.

This also put something from Chapter 4 into another context. Lisa and Janet talked about using retrospectives to figure out if a problem could benefit from hiring more testers. I’ve thought about this a few times in my current role: does the team need another tester?

The answer has to come from considering what led you to ask the question in the first place. If your team is limited just by the number of hours in a day and the number of people you have, then you probably want to ask about what skills would be best to add to the team, be it testing or otherwise. If the team is struggling on testing specifically, then another tester may just be a band-aid; could the team be better served by working on building everybody’s testing skills? What problem are you trying to solve? The answer is never “we have the wrong tester to developer ratio.”

It’s also a great grounding question. Last week I wrote about how important it is to identify specifically what you’re trying to test at any moment. The motivation is the same. If you find yourself struggling with too many variables in the air and a hundred contingencies and tangents about all kinds of other things, stop for a moment and ask: “What problem am I trying to solve?” or “What problem do I need to solve right now?”

Qualifying quantitative risk

Let’s start with quantifying qualitative risk first.

Ages ago I was under pressure from some upper management to justify timelines, and I found a lot of advice about using risk as a tool not only to help managers see what they’re getting from the time spent developing a feature (i.e, less risk) but also to help focus what testing you’re doing. This was also coming hand in hand with a push to loosen up our very well defined test process, which came out of very similar pressure. I introduced the concept of a risk assessment matrix as a way of quantifying risk, and it turned out to be a vital tool for the team in planning our sprints.

Five by five

I can’t the original reference I used to base my version from, because if you simply google “risk assessment matrix” you’ll find dozens of links describing the basic concept. The basic concept is this:

Rate the impact (or consequence) of something going wrong on a scale of 1 to 5, with 1 being effectively unnoticeable 5 being catastrophic.  Rate the likelihood (or probability) of something bad happening from 1 to 5, with 1 being very unlikely and 5 being almost certain. Multiply those together and you get a number that represents how risky it is on a scale from 1 to 25.

a 5x5 multiplication table, with low numbers labelled minimal risk and the highest numbers labelled critical risk

How many ambiguities and room for hand waving can you spot already?

Risk is not objective

One of the biggest problems with a system like this is that there’s a lot of room for interpreting what these scales mean. The numbers 1 to 5 are completely arbitrary so we have to attach some meaning to them. Even the Wikipedia article on risk matrices eschews numbers entirely, using instead qualitative markers laid out in a 5×5 look-up table.

The hardest part of this for me and the team was dealing with the fact that neither impact nor probability are the same for everybody. For impact, I used three different scales to illustrate how different people might react based on impact:

To someone working in operations:

  1. Well that’s annoying
  2. This isn’t great but at least it’s manageable
  3. At least things are only broken internally
  4. People are definitely going to notice something is wrong
  5. Everything is on fire!

To our clients:

  1. It’s ok if it doesn’t work, we’re not using it
  2. It works for pretty much everything except…
  3. I guess it’ll do but let’s make it better
  4. This doesn’t really do what I wanted
  5. This isn’t at all what I asked for

And to us, the developers:

  1. Let’s call this a “nice-to-have” and fix it when there’s time
  2. We’ll put this on the roadmap
  3. We’ll bump whatever was next and put it in the next sprint
  4. We need to get someone on this right away
  5. We need to put everything on this right now

You could probably also frame these as performance impact, functional impact, and project impact. Later iterations adjusted the scales a bit and put in more concrete examples; anything that resulted in lost data for a client, for example, was likely to fall into the maximum impact bucket.

Interestingly, in a recent talk Angie Jones extended the basic idea of a 5×5 to include a bunch of other qualities as a way of deciding whether a test is worth automating. In her scheme, she uses “how quickly would this be fixed” as one dimension of the value of a test, whereas I’m folding that into the impact on the development team. I hadn’t seen other variations of the 5×5 matrix when coming up with these scales, and to me the most direct way of making a developer feel the impact of a bug was to ask whether they’d have to work overtime to fix it.

Probability was difficult in its own way as well. We eventually adopted a scale with each bucket mapping to a ballpark percentage chance of a bug being noticed, but even a qualitative scale from “rare” through to “certain” misses a lot of nuance. How do you compare something that will certainly be noticed by only one client to something that low chance of manifesting for every client? I can’t say we ever solidified a good solution to this, but we got used to whatever our de-facto scale was.

How testing factors in

We discussed the ratings we wanting to give each ticket on impact and probability of problems at the beginning of each sprint. These discussions would surface all kinds of potential bugs, known troublesome areas, unanswered questions, and ideas of what kind of testing needed to be done.

Inevitably, when somebody explained their reasoning for assigning a higher impact than someone else by raising a potential defect, someone else would say “oh, but that’s easy to test for.” This was great—everybody’s thinking about testing!—but it also created a tendency to downplay the risk. Since a lower risk item should do with less thorough testing, we might not plan to do the testing required to justify the low risk. Because of that, we added a caveat to our estimates: we estimated what the risk would be if we did no testing beyond, effectively, turning the thing on.

With that in mind, a risk of 1 could mean that one quick manual test would be enough to send it out the door. The rare time something was rated as high as 20 or 25, I would have a litany of reasons sourced from the team as to why we were nervous about it and what we needed to do to mitigate that. That number assigned to “risk” at the end of the day became a useful barometer for whether the amount of testing we planned to do was reasonable.

Beyond testing

Doing this kind of risk assessment had positive effects outside of calibrating our testing. The more integrated testing and development became, the more clear it was that management couldn’t just blame testing for long timelines on some of these features. I deliberately worked this into how I wanted the risk scale to be interpreted, so that it spoke to both design and testing:

Risk  Interpretation
1-4 Minimal: Can always improve later, just test the basics.
5-10 Moderate: Use a solution that works over in-depth studies, test realistic edge cases, and keep estimates lean.
12-16 Serious: Careful design, detailed testing on edges and corners, and detailed estimates on any extra testing beyond the norm.
20-25 Critical: In-depth studies, specialized testing, and conservative estimates.

These boundaries are always fuzzy, of course, and this whole thing has to be evaluated in context. Going back to Angie Jones’s talk, she uses four of these 5×5 grids to get a score out of 100 for whether a test should be automated, and the full range from 25-75 only answers that question with “possibly”. I really like how she uses this kind of system as a comparison against her “gut check”, and my team used this in much the same way.

The end result

Although I did all kinds of fun stuff with comparing these risk estimates against the story points  we put on them, the total time spent on the ticket, and whether we were spending a reasonable ratio of time on test to development, none of that ever saw practical use beyond “hmmm, that’s kind of interesting” or “yeah that ticket went nuts”. Even though I adopted this tool as a way of responding to pressure from management to justify timelines, they (thankfully) rarely ended up asking for these metrics either. Once a ticket was done and out the door, we rarely cared about what our original risk estimate was.

On the other side, however, I credit these (sometimes long) conversations with how smoothly the rest of our sprints would go; everybody not only had a good understanding of what exactly needed to be done and why, but we arrived at that understanding as a group. We quantified risk to put a number into a tracking system, but the qualitative understanding of what that number meant is where the value lay.

The column that shall not be named

Until recently my team had a very simple set up for our sprint board, with just four steps: Tickets would start as “To Do”, move to “In Progress” when somebody started working on it, then put into “Code Review”, and when everything looked good the code was merged and the ticket marked “Done”.

Four columns, labelled To Do, In Progress, Code Review, and Done.

This worked well until we started to have issues with tickets after merging. A separate team was responsible for building a release and deploying it. As we started to put more tickets through we started to find occasional issues at this stage, either just before deploying or soon after. These didn’t go unnoticed for long — we regularly check all our features in production — but they did show that we needed to pay more attention to the work between merging it and deploying it.

The natural inclination of the team was to do what most other teams in the department have already done: insert a “QA” column between Code Review and Done, where we can test the final build for anything unexpected.

The same four columns, with a QA column added between Code Review and Done

I wasn’t on board with this plan. I think it’s too easy to start thinking that QA only happens when a ticket is in that column. Even if everybody says they know that’s not true, the board can reinforce that idea anyway. Following Richard Bradshaw‘s cue, I tried to explain that QA is something that happens, or should happen, throughout development. The problem is, I didn’t have a good alternative name for that new column:

QA spanning across all five columns, with an unnamed column between Code Review and Done

To me, “Testing” is just as limiting as “QA” (should I not do any testing when something is in code review?) “Verification”, the other candidate, to me seemed too narrow for what would go on in that column. I would have been happy with anything but “QA”. But I’m not sure I really made my point—or they took it too literally—because we ended up with this as our board:

The former Testing column as been re-titled The Column That Shall Not Be Named

And from there it was a small step to this:

The Done column has been renamed to Yer a Wizard, Harry

This has had one delightful side-benefit: at our daily stand-ups, instead of asking if a ticket is done yet, we can now legitimately ask: “Are you a wizard yet?”

Unit tests versus the unit tested

I recently read the great and oft-cited article about testing microservice architectures by Cindy Sridharan over on Medium. It’s broadly applicable beyond just “microservices”, so I highly recommend giving it a read. I was struck by this passage in particular:

The main thrust of my argument wasn’t that unit testing is completely obviated by end-to-end tests, but that being able to correctly identifying the “unit” under test might mean accepting that the unit test might resemble what’s traditionally perceived as “integration testing” involving communication over a network.
— Cindy Sridharan, “Testing Microservices, the Sane Way

Articulating what you’re actually trying to test is one of the most underrated skills in testing. It seems like a simple question to ask, but often a forgotten one. Asking “what information does this test give me that no other test does?” is great heuristic for determining whether a test, especially an automated one, is worth keeping. It’s also a convenient go-to when evaluating whether a tester knows what they’re doing, and when trying to understand what a test does.

What interests me about this passage is that Cindy is highlighting how “the thing being tested” gets conflated with the “unit” in “unit testing”.

When I used to train new testers at my company, I would present four levels of testing: unit, component, integration, and system. Invariably, people would ask what the difference between component and integration was. Unit tests were easy because that tended to be the first (and sometimes) only kind of testing incoming devs were familiar with. System tests were easily understood as the ones at the end with everything up and running. But when are you testing a component versus an integration? It’s not obvious, so it’s no surprise that three-tiered descriptions are much more common these days.

For us, testing a component was sometimes a single service. At other times it was testing a well defined system of a larger program (one that probably would have been a microservice if we built the system from scratch). Inputs and outputs were controlled directly, and no other services needed to be running for it to communicate with. We defined a “unit” as the smallest possible thing that could be tested (often a single function) and the “component” was running the actual program.

Is that different from an “integration” test of those units? Not really. Running the component is an integration of smaller units. But it was convenient for us to separate those tests from testing how separate services communicated with each other. You can ask the same question anywhere along the continuum from function to system. Is testing a class a unit test or a component test? Where do you draw the lines?

The confusion highlighted in the passage above, to me, is only because of different definitions of “unit”. If you want to call the thing, the service or behaviour, that you’re testing a “unit” instead of a “component”, go for it. If you want to call the communication between two services the “unit” (as this article does), great. This ambiguity should not be an obstacle to understanding the point: you should know what you’re testing and have a reason for testing it.

And, of course, that you should be testing the most important things. Which means, for example, don’t mock away the database if your service’s main responsibility is writing to the database. The problem isn’t that Cindy is saying “unit tests should be communicating over the network,” though you might read it that way if you’re dogmatic about the term “unit test”. She’s saying “communicating over the network is important, so I’m going to prioritize that communication as the subject (unit) of my tests.”

For all these terms it’s unlikely we’re ever going to setting on unambiguous definitions. Let’s just try to be clear about what we mean when we use them, and clear about what we’re testing with each test.

If you didn’t test it, it doesn’t work

Gary Bernhardt has a great talk online from 2012 called Boundaries, about how design patterns influence, for better and worse, the testing that can be done. In particular he advocates a “core” and “shell” model, having many independent functional cores and one orchestrating shell around them. The idea is that each functional core can be tested independently from the others more effectively by not having to worry about mocks and dependencies. The tests of each core can be greatly simplified, while focusing on what it’s actually supposed to do.

The concept is definitely solid, but I do take issue with one thing he says towards the end. I’ll acknowledge upfront that I’m drawing much more meaning from his choice of words than I think was intended, it just happened to strike a nerve. It happens at about the 29m30s mark, when he’s describing how small the tests for the functional cores became in his example program. As for the shell, the orchestrating code responsible for connecting the wires between these cores, he says:

The shell is actually untested. The only conditional in the shell is the case statement that we saw that was choosing between keys. There’s no other decision made in the entire shell. So I don’t even really feel the need to test it. I fire the program up, I hit ‘j’, as long as that worked, pretty much the whole thing works.

The obvious objection here is that just because the cores work independently doesn’t mean that the shell manipulates those cores correctly. Conditionals aren’t the only things that can have bugs in them. But that’s not my objection. It’s this:

Firing the program up and hitting ‘j’ to see if that works is testing it.

Are there other tests he could do? Sure. At a minimum, ‘j’ isn’t the only keystroke the program supports. So he’s made a risk assessment that the shell is simple enough that this is the only test he needs. It might be an informal assessment, even an unconscious one, but it’s still based on the foundation of unit tests, his knowledge of how the program works, and how it’s going to be used. There’s absolutely nothing wrong with that.

Has he automated his test? No. But is something only a test if it’s automated? Also no. Much like he made a risk assessment that he only needs one test, there’s also a judgement call (perhaps implicit) that this one test isn’t worth automating. There could be any number of reasons for making that call. That doesn’t mean the sequence of firing it up and trying it out is something outside of “testing it”.

I think there is a tendency these days to think that a test only counts if it’s automated, but we should know better than that. It’s been argued that the parts we automate shouldn’t even be called “tests”. Whether or not you agree with the terminology, it has to be acknowledged that testing encompasses a lot more than automating some checks.

I think this same sentiment is what motivates the #NoTesting hashtag on Twitter. If you have a traditional, waterfall style, gated, or heavily regimented picture of what software testing is, you’ll exclude from that umbrella all kinds of other activities that people naturally do to poke at their products. “We don’t want that stuffy bottleneck-creating deadline-breaking time-suck kind of testing around here”, they say, “so therefore we don’t want testing at all.” I bring this up not to put Gary into that camp—in the context of the talk this quibble of mine wouldn’t affect the point he was making—but to point out that people do take this idea too far.

Coincidentally, the same day I came across this talk, I also read The Honest Manual Writer Heuristic by Michael Bolton. It ends with a line that captures this idea that people do a lot more testing than they realize:

Did you notice that I’ve just described testing without using the word “testing”?

If you didn’t test it, it doesn’t work.

or, to make it perfectly clear what I’m getting at, let’s phrase it like this:

If you know it works, you must have tested it.