Seven things I learned at CAST

My CAST debrief from last week ended up being mostly about personal reflection, but I also wanted to call out the some of the things I picked up more specific to testing. After going through my notes I picked out what I thought where the important themes or interesting insights. There are many other tidbits, interactions, and ideas from those notes that I may write about later, these are just the top seven that I think will have an impact right away.

Each section here is paraphrased from the notes I took in each speaker’s session.

1. “Because we don’t like uncertainty we pretend we don’t have it”
Liz Keogh presented a keynote on using Cynefin as a way of embracing uncertainty and chaos in how we work. We are hard wired to see patterns where they don’t exist. “We treat complexity as if it’s predictable.” People experience the highest stress when shocks are unpredictable, so let’s at least prepare for them. To be safe to fail you have to be able to (1) know if it works, (2) amplify it if it does work, (3) know of it fails, (4) dampen it if it does fail, and (5) have a realistic reason for thinking there will be a positive impact in the first place. It’s not about avoiding failure, it’s about being prepared for it. The linked article above is definitely worth the read.

2. Testers are designers
From John Cutler’s keynote: “Design is the rendering of intent”. Anybody who makes decisions that affect that rendering is a designer. What you test, when you test, who you include, your mental models of the customer, and how we talk about what we find all factor in. UX researchers to have to know that they aren’t the user, but often the tester has used the product more than anybody else in the group. There’s no such thing as a design solution, but there is a design rationale. There should be a narrative around the choices we make for our products, and testers provide a ton of the information for that narrative. Without that narrative, no design was happening (nor any testing!) because there’s no sense of the intent.

3. Knowledge goes stale faster than tickets on a board do
From Liz Keogh’s lightning talk: If testers are at their WIP limit and devs can’t push any more work to them, your team will actually go faster if you make the devs go read a book rather than take on new work that will just pile up. You lose more productivity in catching up on work someone has already forgotten about than you gain in “getting ahead”. In practice, of course, you should have the devs help the testers when they’re full and the testers help the devs when they have spare capacity. (By the way, it’s not a Kanban board unless you have a signal for when you can take on new work, like an empty slot in a column. That empty slot is the kanban.)

4. “No matter how it looks at first, it’s always a people problem” – Jerry Weinberg
I don’t recall which session this quote first came up in, but it became a recurring theme in many of the sessions and discussions, especially around coaching testing, communicating, and whole-team testing. Jerry passed away just before the conference started, and though I had never met him he clearly had a strong influence on many of the people there. He’s probably most often cited for his definition that “quality is value to some person”.

5. Pair testing is one of the most powerful techniques we have
Though I don’t think anybody said this explicitly, this was evident to me from how often the concept came up. Lisi Hocke gave a talk about pairing with testers from other companies to improve her own testing while cross-pollinating testing ideas and skills with the wider community. Amit Wertheimer cited pairing with devs as a great way to identify tools and opportunities to make their lives easier. Jose Lima talked about running group exploratory testing sessions and the benefits that brings in learning about the product and coaching testing. Of course coaching itself, I think, is a form of pairing so the tutorial with Anne-Marie Charrett contributed to this theme as well.  This is something that I need to do more of.

6. BDD is not about testing, it’s about collaboration
From Liz Keogh again: “BDD is an analysis technique, it’s not about testing.” It’s very hard to refactor English and non-technical people rarely read the cucumber scenarios anyway. She says to just use a DSL instead. If you’re just implementing cucumber tests but aren’t having a 3-amigos style conversation with the business about what the scenarios should be, it isn’t BDD. Angie Jones emphasized these same points when introducing cucumber in her automation tutorial as a caveat that she was only covering the automation part of BDD in the tutorial, not BDD itself. Though I’ve worked in styles that called themselves “behaviour driven”, I’ve never worked with actual “BDD”, and this was the first time I’ve heard of it being more than a way of automating test cases.

7. Want to come up with good test ideas? Don’t read the f*ing manual!
From Paul Holland: Detailed specifications kill creativity. Start with high level bullet points and brainstorm from there. Share ideas round-robin (think of playing “yes and“) to build a shared list. Even dumb ideas can trigger good ideas. Encourage even non-sensical ideas since users will do things you didn’t think of. Give yourself time and space away to allow yourself to be creative, and only after you’ve come up with every test idea you can should you start looking at the details. John Cleese is a big inspiration here.

Bonus fact: Anybody can stop a rocket launchThe bright light of a rocket engine taking off in the night
The “T-minus” countdown for rocker launches isn’t continuous, they pause it at various intervals and reset it back to checkpoints when they have to address something. What does this have to do with testing? Just that the launch of the Parker Solar Probe was the weekend after CAST. At 3:30am on Saturday I sat on the beach with my husband listening to the live feed from NASA as the engineers performed the pre-launch checks: starting, stopping, and resetting the clock as needed. I was struck by the fact that at “T minus 1 minute 55 seconds” one calm comment from one person about one threshold being crossed scrubbed the entire launch without any debate. There wouldn’t be time to rewind to the previous checkpoint at T-minus 4 minutes before the launch window closed, so they shut the whole thing down. I’m sure that there’s an analogy to the whole team owning gates in their CD pipelines in there somewhere!

CAST 2018 Debrief

Last week I was lucky enough to attend the Conference of the Association of Software Testing, CAST 2018. I had been to academic conferences with collaborators before, and a local STAR conference here in Toronto, but this was my first time travelling for a professional conference in testing. The actual experience ended up being quite trying, and I ended up learning as much about myself as about testing. I don’t feel the need to detail my whole experience here, but I will highlight the top 5 lessons I took away from it.

1. “Coaching” is not what I hoped it was

I’ve been hearing a lot about “coaching” as a role for testers lately. I went to both Anne-Marie Charrett‘s tutorial and Jose Lima‘s talk on the subject thinking that it was a path I wanted to pursue. I went in thinking about using as a tool to change minds, instill a some of my passion for testing into the people I work with, and building up a culture of quality. I came away with a sense of coaching as more of a discussion method, a passive enterprise available for those who want to engage in it and useless for the uninterested. I suspect those who work as coaches would disagree, but that was nonetheless my impression.

One theme that came up from a few people, not just the speakers, was a distinction between coaching and teaching. This isn’t something I really understand, and is likely part of why I was expecting something else from the subject. I taught university tutorials for several years and put a lot of effort into designing engaging classes. To me, what I saw described as coaching felt like a subset of teaching, a particular style of pedagogy, not something that stands in contrast to it. Do people still hear “teaching” and think “lecturing”? I heard “coaching testing” and expected a broader mandate of education and public outreach that I associate with “teaching”.

Specifically, I was looking for insight on breaking through to people who don’t like testing, and who don’t want to learn about it, but very quickly saw that “coaching” wasn’t going to help me with that. At least not on the level at which we got into it in within one workshop. I am sure that this is something that would be interesting to hash out in a (meta) coaching session with people like Anne-Marie and Jose, even James Bach and Michael Bolton: i.e. people who have much more knowledge about how coaching can be used than I do.

2. I’m more “advanced” than I thought

My second day at the conference was spent in a class billed as “Advanced Automation” with Angie Jones (@techgirl1908). I chose this tutorial over other equally enticing options because it looked like the best opportunity for something technically oriented, and would produce a tangible artefact — an advanced automated test suite — that I could show off at home and assimilate aspects of into my own automation work.

Angie did a great job of walking us through implementing the framework and justifying the thought process each step of the way. It was a great exercise for me to go through implementing a java test suite from scratch, including a proper Page Object Model architecture and a TDD approach. It was my first time using Cucumber in java, and I quite enjoyed the commentary on hiring API testers as we implemented a test with Rest-Assured.

Though I did leave with that tangible working automation artefact at the end of the day, I did find that a reverse-Pareto principle at play with 80% of the value coming from the last 20% of the time. This is what lead to my take away that I might be more advanced than I had thought. I still don’t consider myself an expert programmer, but I think I could have gotten a lot further had we started with a basic test case already implemented. Interestingly Angie’s own description for another workshop of hers say “It’s almost impossible to find examples online that go beyond illustrating how to automate a basic login page,” though that’s the example we spent roughly half the day on. Perhaps we’ve conflated “advanced” with “well designed”.

3. The grass is sometimes greener

In any conference, talks will vary both in quality generally and how much they resonate with any speaker specifically. I was thrilled by John Cutler‘s keynote address on Thursday — he struck many chords about the connection between UX and testing that align very closely with my own work — but meanwhile Amit Wertheimer just wrote that he “didn’t connect at all” to it. I wasn’t challenged by Angie’s advanced automation class but certainly others in the room were. This is how it goes.

In a multi-track conference, there’s an added layer that there’s other rooms you could be in that you might get more value from. At one point, I found myself getting dragged down in a feeling that I was missing out on better sessions on the other side of the wall. Even though there were plenty of sessions where I know I was in the best room for myself, the chatter on Twitter and the conference slack workspace sometimes painted a picture of very green grass elsewhere. Going back to Amit’s post, he called Marianne Duijst‘s talk about Narratology and Harry Potter one of the highlights of the whole conference, and I’ve seen a few others echo the same sentiment on Twitter. I had it highlighted on my schedule from day one but at the last minute was enticed by the lightning talks session. I got pages of notes from those talks, but I can’t help but wonder what I missed. Social Media FOMO is real and it takes a lot of mental energy to break out of that negative mental cycle.

Luckily, the flip side of that kind of FOMO is that asking about a session someone else was in, or gave themselves, is a great conversation starter during the coffee breaks.

4. Networking is the worst

For other conferences I’ve been to, I had the benefit either of going with a group of collaborators I already knew or being a local so I could go home at 5 at not worry about dinner plans. Not true when flying alone across the continent. I’ve always been an introvert at the best of times, and I had a hard time breaking out of that to go “network”.

I was relieved when I came across Lisa Crispin writing about how she similarly struggled when she first went to conferences, although that might have helped me more last week than today. Though I’m sure it was in my imagination just as much as it was in hers at her first conference, I definitely felt the presence of “cliques” that made it hard to break in. Ironically, those that go to conferences regularly are less likely to see this happening, since those are the people that already know each other. Speakers and organizers even less so.

It did get much easier once we moved to multiple shorter sessions in the day (lots of coffee breaks) and an organized reception on Wednesday. I might have liked an organized meet-and-greet on the first day, or even the night before the first tutorial, where an introvert like me can lean a bit more on the social safety net of mandated mingling. Sounds fun when I put it like that, right?

I eventually got comfortable enough to start talking with people and go out on a limb here or there. I introduced myself to the all people I aimed to and asked all the questions I wanted to ask… eventually. But there were also a lot of opportunities that I could have taken better advantage of. At my next conference, this is something I can do better for myself, though it also gives me a bit more sensitivity about what inclusion means.

5. I’m ready to start preparing my own talk

Despite my introverted tendencies I’ve always enjoyed teaching, presenting demos, and giving talks. I’ve had some ideas percolating in the back of my mind about what I can bring to the testing community and my experiences this week — in fact every one of the four points above — have confirmed for me that speaking at a conference is a good goal for myself, and that I do have some value to add to the conversation. I have some work to do.

Bonus lessons: Pronouncing “Cynefin” and that funny little squiggle

Among the speakers, as far as notes-written-per-sentence-spoken, Liz Keogh was a pretty clear winner by virtue of a stellar lightning talk. Her keynote and the conversation we had afterward, however, is where I picked up these bonus lessons. I had heard of Cynefin before but always had two questions that never seemed to be answered in the descriptions I had read, until this week:

A figure showing the four domains of Cynefin

  1. It’s pronounced like “Kevin” but with an extra “N”
  2. The little hook or squiggle at the bottom of the Cynefin figure you see everywhere is actually meaningful: like a fold in some fabric, it indicates a change in height from the obvious/simple domain in the lower right from which you can fall into the chaotic in the lower left.

Qualifying quantitative risk

Let’s start with quantifying qualitative risk first.

Ages ago I was under pressure from some upper management to justify timelines, and I found a lot of advice about using risk as a tool not only to help managers see what they’re getting from the time spent developing a feature (i.e, less risk) but also to help focus what testing you’re doing. This was also coming hand in hand with a push to loosen up our very well defined test process, which came out of very similar pressure. I introduced the concept of a risk assessment matrix as a way of quantifying risk, and it turned out to be a vital tool for the team in planning our sprints.

Five by five

I can’t the original reference I used to base my version from, because if you simply google “risk assessment matrix” you’ll find dozens of links describing the basic concept. The basic concept is this:

Rate the impact (or consequence) of something going wrong on a scale of 1 to 5, with 1 being effectively unnoticeable 5 being catastrophic.  Rate the likelihood (or probability) of something bad happening from 1 to 5, with 1 being very unlikely and 5 being almost certain. Multiply those together and you get a number that represents how risky it is on a scale from 1 to 25.

a 5x5 multiplication table, with low numbers labelled minimal risk and the highest numbers labelled critical risk

How many ambiguities and room for hand waving can you spot already?

Risk is not objective

One of the biggest problems with a system like this is that there’s a lot of room for interpreting what these scales mean. The numbers 1 to 5 are completely arbitrary so we have to attach some meaning to them. Even the Wikipedia article on risk matrices eschews numbers entirely, using instead qualitative markers laid out in a 5×5 look-up table.

The hardest part of this for me and the team was dealing with the fact that neither impact nor probability are the same for everybody. For impact, I used three different scales to illustrate how different people might react based on impact:

To someone working in operations:

  1. Well that’s annoying
  2. This isn’t great but at least it’s manageable
  3. At least things are only broken internally
  4. People are definitely going to notice something is wrong
  5. Everything is on fire!

To our clients:

  1. It’s ok if it doesn’t work, we’re not using it
  2. It works for pretty much everything except…
  3. I guess it’ll do but let’s make it better
  4. This doesn’t really do what I wanted
  5. This isn’t at all what I asked for

And to us, the developers:

  1. Let’s call this a “nice-to-have” and fix it when there’s time
  2. We’ll put this on the roadmap
  3. We’ll bump whatever was next and put it in the next sprint
  4. We need to get someone on this right away
  5. We need to put everything on this right now

You could probably also frame these as performance impact, functional impact, and project impact. Later iterations adjusted the scales a bit and put in more concrete examples; anything that resulted in lost data for a client, for example, was likely to fall into the maximum impact bucket.

Interestingly, in a recent talk Angie Jones extended the basic idea of a 5×5 to include a bunch of other qualities as a way of deciding whether a test is worth automating. In her scheme, she uses “how quickly would this be fixed” as one dimension of the value of a test, whereas I’m folding that into the impact on the development team. I hadn’t seen other variations of the 5×5 matrix when coming up with these scales, and to me the most direct way of making a developer feel the impact of a bug was to ask whether they’d have to work overtime to fix it.

Probability was difficult in its own way as well. We eventually adopted a scale with each bucket mapping to a ballpark percentage chance of a bug being noticed, but even a qualitative scale from “rare” through to “certain” misses a lot of nuance. How do you compare something that will certainly be noticed by only one client to something that low chance of manifesting for every client? I can’t say we ever solidified a good solution to this, but we got used to whatever our de-facto scale was.

How testing factors in

We discussed the ratings we wanting to give each ticket on impact and probability of problems at the beginning of each sprint. These discussions would surface all kinds of potential bugs, known troublesome areas, unanswered questions, and ideas of what kind of testing needed to be done.

Inevitably, when somebody explained their reasoning for assigning a higher impact than someone else by raising a potential defect, someone else would say “oh, but that’s easy to test for.” This was great—everybody’s thinking about testing!—but it also created a tendency to downplay the risk. Since a lower risk item should do with less thorough testing, we might not plan to do the testing required to justify the low risk. Because of that, we added a caveat to our estimates: we estimated what the risk would be if we did no testing beyond, effectively, turning the thing on.

With that in mind, a risk of 1 could mean that one quick manual test would be enough to send it out the door. The rare time something was rated as high as 20 or 25, I would have a litany of reasons sourced from the team as to why we were nervous about it and what we needed to do to mitigate that. That number assigned to “risk” at the end of the day became a useful barometer for whether the amount of testing we planned to do was reasonable.

Beyond testing

Doing this kind of risk assessment had positive effects outside of calibrating our testing. The more integrated testing and development became, the more clear it was that management couldn’t just blame testing for long timelines on some of these features. I deliberately worked this into how I wanted the risk scale to be interpreted, so that it spoke to both design and testing:

Risk  Interpretation
1-4 Minimal: Can always improve later, just test the basics.
5-10 Moderate: Use a solution that works over in-depth studies, test realistic edge cases, and keep estimates lean.
12-16 Serious: Careful design, detailed testing on edges and corners, and detailed estimates on any extra testing beyond the norm.
20-25 Critical: In-depth studies, specialized testing, and conservative estimates.

These boundaries are always fuzzy, of course, and this whole thing has to be evaluated in context. Going back to Angie Jones’s talk, she uses four of these 5×5 grids to get a score out of 100 for whether a test should be automated, and the full range from 25-75 only answers that question with “possibly”. I really like how she uses this kind of system as a comparison against her “gut check”, and my team used this in much the same way.

The end result

Although I did all kinds of fun stuff with comparing these risk estimates against the story points  we put on them, the total time spent on the ticket, and whether we were spending a reasonable ratio of time on test to development, none of that ever saw practical use beyond “hmmm, that’s kind of interesting” or “yeah that ticket went nuts”. Even though I adopted this tool as a way of responding to pressure from management to justify timelines, they (thankfully) rarely ended up asking for these metrics either. Once a ticket was done and out the door, we rarely cared about what our original risk estimate was.

On the other side, however, I credit these (sometimes long) conversations with how smoothly the rest of our sprints would go; everybody not only had a good understanding of what exactly needed to be done and why, but we arrived at that understanding as a group. We quantified risk to put a number into a tracking system, but the qualitative understanding of what that number meant is where the value lay.