Seven things I learned at CAST

My CAST debrief from last week ended up being mostly about personal reflection, but I also wanted to call out the some of the things I picked up more specific to testing. After going through my notes I picked out what I thought where the important themes or interesting insights. There are many other tidbits, interactions, and ideas from those notes that I may write about later, these are just the top seven that I think will have an impact right away.

Each section here is paraphrased from the notes I took in each speaker’s session.

1. “Because we don’t like uncertainty we pretend we don’t have it”
Liz Keogh presented a keynote on using Cynefin as a way of embracing uncertainty and chaos in how we work. We are hard wired to see patterns where they don’t exist. “We treat complexity as if it’s predictable.” People experience the highest stress when shocks are unpredictable, so let’s at least prepare for them. To be safe to fail you have to be able to (1) know if it works, (2) amplify it if it does work, (3) know of it fails, (4) dampen it if it does fail, and (5) have a realistic reason for thinking there will be a positive impact in the first place. It’s not about avoiding failure, it’s about being prepared for it. The linked article above is definitely worth the read.

2. Testers are designers
From John Cutler’s keynote: “Design is the rendering of intent”. Anybody who makes decisions that affect that rendering is a designer. What you test, when you test, who you include, your mental models of the customer, and how we talk about what we find all factor in. UX researchers to have to know that they aren’t the user, but often the tester has used the product more than anybody else in the group. There’s no such thing as a design solution, but there is a design rationale. There should be a narrative around the choices we make for our products, and testers provide a ton of the information for that narrative. Without that narrative, no design was happening (nor any testing!) because there’s no sense of the intent.

3. Knowledge goes stale faster than tickets on a board do
From Liz Keogh’s lightning talk: If testers are at their WIP limit and devs can’t push any more work to them, your team will actually go faster if you make the devs go read a book rather than take on new work that will just pile up. You lose more productivity in catching up on work someone has already forgotten about than you gain in “getting ahead”. In practice, of course, you should have the devs help the testers when they’re full and the testers help the devs when they have spare capacity. (By the way, it’s not a Kanban board unless you have a signal for when you can take on new work, like an empty slot in a column. That empty slot is the kanban.)

4. “No matter how it looks at first, it’s always a people problem” – Jerry Weinberg
I don’t recall which session this quote first came up in, but it became a recurring theme in many of the sessions and discussions, especially around coaching testing, communicating, and whole-team testing. Jerry passed away just before the conference started, and though I had never met him he clearly had a strong influence on many of the people there. He’s probably most often cited for his definition that “quality is value to some person”.

5. Pair testing is one of the most powerful techniques we have
Though I don’t think anybody said this explicitly, this was evident to me from how often the concept came up. Lisi Hocke gave a talk about pairing with testers from other companies to improve her own testing while cross-pollinating testing ideas and skills with the wider community. Amit Wertheimer cited pairing with devs as a great way to identify tools and opportunities to make their lives easier. Jose Lima talked about running group exploratory testing sessions and the benefits that brings in learning about the product and coaching testing. Of course coaching itself, I think, is a form of pairing so the tutorial with Anne-Marie Charrett contributed to this theme as well.  This is something that I need to do more of.

6. BDD is not about testing, it’s about collaboration
From Liz Keogh again: “BDD is an analysis technique, it’s not about testing.” It’s very hard to refactor English and non-technical people rarely read the cucumber scenarios anyway. She says to just use a DSL instead. If you’re just implementing cucumber tests but aren’t having a 3-amigos style conversation with the business about what the scenarios should be, it isn’t BDD. Angie Jones emphasized these same points when introducing cucumber in her automation tutorial as a caveat that she was only covering the automation part of BDD in the tutorial, not BDD itself. Though I’ve worked in styles that called themselves “behaviour driven”, I’ve never worked with actual “BDD”, and this was the first time I’ve heard of it being more than a way of automating test cases.

7. Want to come up with good test ideas? Don’t read the f*ing manual!
From Paul Holland: Detailed specifications kill creativity. Start with high level bullet points and brainstorm from there. Share ideas round-robin (think of playing “yes and“) to build a shared list. Even dumb ideas can trigger good ideas. Encourage even non-sensical ideas since users will do things you didn’t think of. Give yourself time and space away to allow yourself to be creative, and only after you’ve come up with every test idea you can should you start looking at the details. John Cleese is a big inspiration here.

Bonus fact: Anybody can stop a rocket launchThe bright light of a rocket engine taking off in the night
The “T-minus” countdown for rocker launches isn’t continuous, they pause it at various intervals and reset it back to checkpoints when they have to address something. What does this have to do with testing? Just that the launch of the Parker Solar Probe was the weekend after CAST. At 3:30am on Saturday I sat on the beach with my husband listening to the live feed from NASA as the engineers performed the pre-launch checks: starting, stopping, and resetting the clock as needed. I was struck by the fact that at “T minus 1 minute 55 seconds” one calm comment from one person about one threshold being crossed scrubbed the entire launch without any debate. There wouldn’t be time to rewind to the previous checkpoint at T-minus 4 minutes before the launch window closed, so they shut the whole thing down. I’m sure that there’s an analogy to the whole team owning gates in their CD pipelines in there somewhere!

Qualifying quantitative risk

Let’s start with quantifying qualitative risk first.

Ages ago I was under pressure from some upper management to justify timelines, and I found a lot of advice about using risk as a tool not only to help managers see what they’re getting from the time spent developing a feature (i.e, less risk) but also to help focus what testing you’re doing. This was also coming hand in hand with a push to loosen up our very well defined test process, which came out of very similar pressure. I introduced the concept of a risk assessment matrix as a way of quantifying risk, and it turned out to be a vital tool for the team in planning our sprints.

Five by five

I can’t the original reference I used to base my version from, because if you simply google “risk assessment matrix” you’ll find dozens of links describing the basic concept. The basic concept is this:

Rate the impact (or consequence) of something going wrong on a scale of 1 to 5, with 1 being effectively unnoticeable 5 being catastrophic.  Rate the likelihood (or probability) of something bad happening from 1 to 5, with 1 being very unlikely and 5 being almost certain. Multiply those together and you get a number that represents how risky it is on a scale from 1 to 25.

a 5x5 multiplication table, with low numbers labelled minimal risk and the highest numbers labelled critical risk

How many ambiguities and room for hand waving can you spot already?

Risk is not objective

One of the biggest problems with a system like this is that there’s a lot of room for interpreting what these scales mean. The numbers 1 to 5 are completely arbitrary so we have to attach some meaning to them. Even the Wikipedia article on risk matrices eschews numbers entirely, using instead qualitative markers laid out in a 5×5 look-up table.

The hardest part of this for me and the team was dealing with the fact that neither impact nor probability are the same for everybody. For impact, I used three different scales to illustrate how different people might react based on impact:

To someone working in operations:

  1. Well that’s annoying
  2. This isn’t great but at least it’s manageable
  3. At least things are only broken internally
  4. People are definitely going to notice something is wrong
  5. Everything is on fire!

To our clients:

  1. It’s ok if it doesn’t work, we’re not using it
  2. It works for pretty much everything except…
  3. I guess it’ll do but let’s make it better
  4. This doesn’t really do what I wanted
  5. This isn’t at all what I asked for

And to us, the developers:

  1. Let’s call this a “nice-to-have” and fix it when there’s time
  2. We’ll put this on the roadmap
  3. We’ll bump whatever was next and put it in the next sprint
  4. We need to get someone on this right away
  5. We need to put everything on this right now

You could probably also frame these as performance impact, functional impact, and project impact. Later iterations adjusted the scales a bit and put in more concrete examples; anything that resulted in lost data for a client, for example, was likely to fall into the maximum impact bucket.

Interestingly, in a recent talk Angie Jones extended the basic idea of a 5×5 to include a bunch of other qualities as a way of deciding whether a test is worth automating. In her scheme, she uses “how quickly would this be fixed” as one dimension of the value of a test, whereas I’m folding that into the impact on the development team. I hadn’t seen other variations of the 5×5 matrix when coming up with these scales, and to me the most direct way of making a developer feel the impact of a bug was to ask whether they’d have to work overtime to fix it.

Probability was difficult in its own way as well. We eventually adopted a scale with each bucket mapping to a ballpark percentage chance of a bug being noticed, but even a qualitative scale from “rare” through to “certain” misses a lot of nuance. How do you compare something that will certainly be noticed by only one client to something that low chance of manifesting for every client? I can’t say we ever solidified a good solution to this, but we got used to whatever our de-facto scale was.

How testing factors in

We discussed the ratings we wanting to give each ticket on impact and probability of problems at the beginning of each sprint. These discussions would surface all kinds of potential bugs, known troublesome areas, unanswered questions, and ideas of what kind of testing needed to be done.

Inevitably, when somebody explained their reasoning for assigning a higher impact than someone else by raising a potential defect, someone else would say “oh, but that’s easy to test for.” This was great—everybody’s thinking about testing!—but it also created a tendency to downplay the risk. Since a lower risk item should do with less thorough testing, we might not plan to do the testing required to justify the low risk. Because of that, we added a caveat to our estimates: we estimated what the risk would be if we did no testing beyond, effectively, turning the thing on.

With that in mind, a risk of 1 could mean that one quick manual test would be enough to send it out the door. The rare time something was rated as high as 20 or 25, I would have a litany of reasons sourced from the team as to why we were nervous about it and what we needed to do to mitigate that. That number assigned to “risk” at the end of the day became a useful barometer for whether the amount of testing we planned to do was reasonable.

Beyond testing

Doing this kind of risk assessment had positive effects outside of calibrating our testing. The more integrated testing and development became, the more clear it was that management couldn’t just blame testing for long timelines on some of these features. I deliberately worked this into how I wanted the risk scale to be interpreted, so that it spoke to both design and testing:

Risk  Interpretation
1-4 Minimal: Can always improve later, just test the basics.
5-10 Moderate: Use a solution that works over in-depth studies, test realistic edge cases, and keep estimates lean.
12-16 Serious: Careful design, detailed testing on edges and corners, and detailed estimates on any extra testing beyond the norm.
20-25 Critical: In-depth studies, specialized testing, and conservative estimates.

These boundaries are always fuzzy, of course, and this whole thing has to be evaluated in context. Going back to Angie Jones’s talk, she uses four of these 5×5 grids to get a score out of 100 for whether a test should be automated, and the full range from 25-75 only answers that question with “possibly”. I really like how she uses this kind of system as a comparison against her “gut check”, and my team used this in much the same way.

The end result

Although I did all kinds of fun stuff with comparing these risk estimates against the story points  we put on them, the total time spent on the ticket, and whether we were spending a reasonable ratio of time on test to development, none of that ever saw practical use beyond “hmmm, that’s kind of interesting” or “yeah that ticket went nuts”. Even though I adopted this tool as a way of responding to pressure from management to justify timelines, they (thankfully) rarely ended up asking for these metrics either. Once a ticket was done and out the door, we rarely cared about what our original risk estimate was.

On the other side, however, I credit these (sometimes long) conversations with how smoothly the rest of our sprints would go; everybody not only had a good understanding of what exactly needed to be done and why, but we arrived at that understanding as a group. We quantified risk to put a number into a tracking system, but the qualitative understanding of what that number meant is where the value lay.