Rethinking velocity

I’ve been thinking about the concept of “velocity” in software development the last few days, in response to a couple cases recently where I’ve seen people express dislike for it. I can understand why, and the more I think about it the more I think that the concept does more harm than good.

When I was first introduced to the concept it was with the analogy of running a race. To win, a runner wants to increase their velocity, running the same distance in a shorter amount of time. Even though the distance is the same each time they run the race, with practice the runner can complete it faster and faster.

The distance run, in the software development analogy, is the feature completed. In Scrum, velocity is the measure how many story points a team completes in a sprint. Individual pieces of work are sized by their “complexity”, so with practice, a team should be able to increase their velocity by finishing work of a given complexity in less time. I have trouble with this first because story points are problematic at best, so any velocity you try to calculate will be easily manipulated. Since I’ve gotten into trouble with the Scrum word police before, I’m going to put that aside for a moment and say that the units you use don’t matter for what I’m talking about.

It should be fair to say that increasing velocity as Scrum defines it is about being able to do more complex work within a sprint without actually doing more work (more time, more effort), because the team gets better at doing that work. (This works both for a larger amount of a fixed complexity of work, or a sprint’s worth of work that is more complex than could have been done in previous sprints.) Without worry about some nebulous “points”, the concept is still about being do more than you could before in a fixed amount of time.

But that’s not what people actually hear when you say we need to “increase velocity”.

Rather, it feels like being asked to do the work faster and faster. Put the feature factory at full steam! You need to work faster, you need to get more done, you need to be able to finish any feature in less than two weeks. Asking how you can increase velocity doesn’t ask “how can we make this work easier?” It asks, “why is this taking so long?” It feels like a judgement, and so we react negatively to it.

While it certainly does make sense to try to make repeated work easier with each iteration, I don’t think that should be the goal of a team. The point of being agile as I’ve come to understand it (and I’ll go with a small “a” here again to avoid the language police) is to be flexible to change by encouraging shorter feedback cycles, which itself is only possible by delivering incrementally to customers earlier than if we worked in large process-heavy single-delivery projects.

Building working-ish versions of a widget and delivering incremental improvements more often might take longer to get to the finished widget, but with the constant corrections along the way the end result should actually be better than it otherwise would have been. And, of course, if the earlier iterations can meet some level of the customer’s need, then they starting getting value far sooner in the process as well. The complexity of the widget doesn’t change, but I’d be happy to take the lower velocity for a higher quality product.

I’m bringing it back to smaller increments and getting feedback because one of the conversations that led to this thinking was about whether putting our products in front of users sooner was the same as asking for increased velocity. Specifically, I said “If you aren’t putting your designs in front of users, you’re wasting your time.” In a sense, I am asking for something to be faster, and going faster means velocity, so these concepts get conflated.

The “velocity” I care about isn’t the number of points done in a given time frame (or number of stories, or number of any kind of deliverable.) What I care about is, how many feedback points did I get in that time? How many times did we check our assumptions? How sure are we that the work we’ve been doing is actually going to provide the value needed? Maybe “feedback frequency” is what we should be talking about.

A straight line from start to finish for a completed widget with feedback at the start and env, vs a looping line with seven points of feedback but longer to get to the end.
And this is generously assuming you have a good idea of what needs to be built in the first place.

Importantly, I’m not necessarily talking about getting feedback on a completed (or prototype) feature delivered to production. Much like I argued that you can demo things that aren’t done, there is information to be gained at every stage of development, from initial idea through design to prototypes and final polish. I’ve always been an information junkie, so I see any information about the outside world, be it anecdotal oddities to huge statistical models of tracking behaviours in your app. Even just making observations about the world, learning about your intended users’ needs before you know what to offer them, all feeds into this category. Too often this happens only once at the outset. A second time when all else is said and done if you’re lucky. I’m not well versed in the design and user experience side of thing yet, but I wager even the big picture, blue-sky, big steps and exploration we might want to do can still be checked against the outside world more often than most people think.

Much like “agile” and “automation“, the word “velocity” itself has become a distraction. People associate it with the sense of wanting to do the same thing faster and faster. What I actually want is to do things in smaller chunks, more often. Higher frequency adjustments to remain agile and build better products, not just rushing for the finish line.

Highlights from one day at Web Unleashed 2018

I pretended to be a Front-End Developer for the day on Monday and attended some sessions at FITC’s Web Unleashed conference. Here are some of the things I found interesting, paraphrasing my notes from each speaker’s talk.

Responsive Design – Beyond Our Devices

Ethan Marcotte talked about the shift from pages as the central element in web design to patterns. Beware the trap of “as goes the mockup, so goes the markup”. This means designing the priority of content, not just the layout. From there it’s easier to take a device-agnostic approach, where you start with the same baseline layout that works across all devices, and enhancing them from there based on the features supported. He wrapped up with a discussion of the value that having a style guide brings, and pointed to a roundup of tools for creating one by Susan Robertson, highlighting that centering on a common design language and naming our patterns helps us understand the intent and use them consistently.

I liked the example of teams struggling with the ambiguity in Atomic Design’s molecules and organisms because I have had the same problem the first time I saw it.

Think Like a Hacker – Avoiding Common Web Vulnerabilities

Kristina Balaam reviewed a few common web scripting vulnerabilities. The slides from her talk have demos of each attack on a dummy site. Having worked mostly in the back-end until relatively recently, cross-site scripting is still something I’m learning a lot about, so despite these being quite basic, I admit I spent a good portion of this talk thinking “I wonder if any of my stuff is vulnerable to this.” She pointed to OWASP as a great resource, especially their security audit checklists and code review guidelines. Their site is a bit of a mess to navigate but it’s definitely going into my library.

As a sidenote, her slides say security should “be paramount to QA”, though verbally she said “as paramount as QA”. Either way, this got me thinking about how it fits into customer-first testing, given that it’s often something that’s invisible to the user (until it’s painfully not). There may be a useful distinction there between modern testing as advocating for the user vs having all priorities set by the user, strongly dependent on the nature of that relationship.

Inclusive Form Inputs

Andréa Crofts gave several interesting examples (along with some do’s and don’ts) of how to make forms inclusive. The theme was generally to think of spectra, rather than binaries. Gender is the familiar one here, but something new to me was that you should offer the ability to select multiple options to allow a richer expression of gender identity. Certainly avoid “other” and anything else that creates one path for some people and another for everybody else. She pointed to an article on designing forms for gender by Sabrina Foncesca as a good reference. Also interesting was citizenship (there are many different legal statuses than just “citizen” and “non-citizen”) and the cultural assumptions that are built into the common default security questions. Most importantly: explain why you need the data at all, and talk to people about how to best ask for it. There are more resources on inclusive forms on her website.

Our Human Experience

Haris Mahmood had a bunch of great examples of how our biases creep into our work as developers. Google Translate, for one, treats the Turkish gender neutral pronoun “o” differently when used with historically male or female dominated jobs, just as a result of the learning algorithms being trained on historical texts. Failures in software recognizing people of dark skin was another poignant example. My takeaway: bias in means bias out.

My favourite question of the day came from a woman in the back: “how do you get white tech bros to give a shit?”

Prototyping for Speed & Scale

Finally, Carl Sziebert ran though a ton of design prototyping concepts. Emphasizing the range of possible fidelity in prototypes really helped to show how many different options there are to get fast feedback on our product ideas. Everything from low-fi paper sketches to high-fi user stories implemented in code (to be evaluated by potential users) can help us learn something. The Skeptic’s Guide to Low Fidelity Prototyping by Laura Busche might help convince people to try it, and Taylor Palmer’s list of every UX prototyping tool ever is sure to have anything you need for the fancier stages. (I’m particularly interested to take a closer look at Framer X for my React projects.)

He also talked about prototypes as a way to choose between technology stacks, as a compromise and collaboration tool, a way of identifying “cognitive friction” (repeated clicks and long time between actions to show that something isn’t behaving the way the user expects, for example), and a way of centering design around humans. All aspects that I want to start embracing. His slides have a lot of great visuals to go with these concepts.

Part of the fun of being at a front-end and design-focused conference was seeing how many common themes there are with the conversation happening in the testing space. Carl mentioned the “3-legged stool” metaphor that they use at Google—an engineer, a UX designer, and a PM—that is a clear cousin (at least in spirit if not by heritage) of the classic “3 amigos”—a business person, developer, and tester.

This will be all be good fodder to when a lead a UX round-table at the end of the month. You’d be forgiven for forgetting that I’m actually a tester.

Seven things I learned at CAST

My CAST debrief from last week ended up being mostly about personal reflection, but I also wanted to call out the some of the things I picked up more specific to testing. After going through my notes I picked out what I thought where the important themes or interesting insights. There are many other tidbits, interactions, and ideas from those notes that I may write about later, these are just the top seven that I think will have an impact right away.

Each section here is paraphrased from the notes I took in each speaker’s session.

1. “Because we don’t like uncertainty we pretend we don’t have it”
Liz Keogh presented a keynote on using Cynefin as a way of embracing uncertainty and chaos in how we work. We are hard wired to see patterns where they don’t exist. “We treat complexity as if it’s predictable.” People experience the highest stress when shocks are unpredictable, so let’s at least prepare for them. To be safe to fail you have to be able to (1) know if it works, (2) amplify it if it does work, (3) know of it fails, (4) dampen it if it does fail, and (5) have a realistic reason for thinking there will be a positive impact in the first place. It’s not about avoiding failure, it’s about being prepared for it. The linked article above is definitely worth the read.

2. Testers are designers
From John Cutler’s keynote: “Design is the rendering of intent”. Anybody who makes decisions that affect that rendering is a designer. What you test, when you test, who you include, your mental models of the customer, and how we talk about what we find all factor in. UX researchers to have to know that they aren’t the user, but often the tester has used the product more than anybody else in the group. There’s no such thing as a design solution, but there is a design rationale. There should be a narrative around the choices we make for our products, and testers provide a ton of the information for that narrative. Without that narrative, no design was happening (nor any testing!) because there’s no sense of the intent.

3. Knowledge goes stale faster than tickets on a board do
From Liz Keogh’s lightning talk: If testers are at their WIP limit and devs can’t push any more work to them, your team will actually go faster if you make the devs go read a book rather than take on new work that will just pile up. You lose more productivity in catching up on work someone has already forgotten about than you gain in “getting ahead”. In practice, of course, you should have the devs help the testers when they’re full and the testers help the devs when they have spare capacity. (By the way, it’s not a Kanban board unless you have a signal for when you can take on new work, like an empty slot in a column. That empty slot is the kanban.)

4. “No matter how it looks at first, it’s always a people problem” – Jerry Weinberg
I don’t recall which session this quote first came up in, but it became a recurring theme in many of the sessions and discussions, especially around coaching testing, communicating, and whole-team testing. Jerry passed away just before the conference started, and though I had never met him he clearly had a strong influence on many of the people there. He’s probably most often cited for his definition that “quality is value to some person”.

5. Pair testing is one of the most powerful techniques we have
Though I don’t think anybody said this explicitly, this was evident to me from how often the concept came up. Lisi Hocke gave a talk about pairing with testers from other companies to improve her own testing while cross-pollinating testing ideas and skills with the wider community. Amit Wertheimer cited pairing with devs as a great way to identify tools and opportunities to make their lives easier. Jose Lima talked about running group exploratory testing sessions and the benefits that brings in learning about the product and coaching testing. Of course coaching itself, I think, is a form of pairing so the tutorial with Anne-Marie Charrett contributed to this theme as well.  This is something that I need to do more of.

6. BDD is not about testing, it’s about collaboration
From Liz Keogh again: “BDD is an analysis technique, it’s not about testing.” It’s very hard to refactor English and non-technical people rarely read the cucumber scenarios anyway. She says to just use a DSL instead. If you’re just implementing cucumber tests but aren’t having a 3-amigos style conversation with the business about what the scenarios should be, it isn’t BDD. Angie Jones emphasized these same points when introducing cucumber in her automation tutorial as a caveat that she was only covering the automation part of BDD in the tutorial, not BDD itself. Though I’ve worked in styles that called themselves “behaviour driven”, I’ve never worked with actual “BDD”, and this was the first time I’ve heard of it being more than a way of automating test cases.

7. Want to come up with good test ideas? Don’t read the f*ing manual!
From Paul Holland: Detailed specifications kill creativity. Start with high level bullet points and brainstorm from there. Share ideas round-robin (think of playing “yes and“) to build a shared list. Even dumb ideas can trigger good ideas. Encourage even non-sensical ideas since users will do things you didn’t think of. Give yourself time and space away to allow yourself to be creative, and only after you’ve come up with every test idea you can should you start looking at the details. John Cleese is a big inspiration here.

Bonus fact: Anybody can stop a rocket launchThe bright light of a rocket engine taking off in the night
The “T-minus” countdown for rocker launches isn’t continuous, they pause it at various intervals and reset it back to checkpoints when they have to address something. What does this have to do with testing? Just that the launch of the Parker Solar Probe was the weekend after CAST. At 3:30am on Saturday I sat on the beach with my husband listening to the live feed from NASA as the engineers performed the pre-launch checks: starting, stopping, and resetting the clock as needed. I was struck by the fact that at “T minus 1 minute 55 seconds” one calm comment from one person about one threshold being crossed scrubbed the entire launch without any debate. There wouldn’t be time to rewind to the previous checkpoint at T-minus 4 minutes before the launch window closed, so they shut the whole thing down. I’m sure that there’s an analogy to the whole team owning gates in their CD pipelines in there somewhere!

Qualifying quantitative risk

Let’s start with quantifying qualitative risk first.

Ages ago I was under pressure from some upper management to justify timelines, and I found a lot of advice about using risk as a tool not only to help managers see what they’re getting from the time spent developing a feature (i.e, less risk) but also to help focus what testing you’re doing. This was also coming hand in hand with a push to loosen up our very well defined test process, which came out of very similar pressure. I introduced the concept of a risk assessment matrix as a way of quantifying risk, and it turned out to be a vital tool for the team in planning our sprints.

Five by five

I can’t the original reference I used to base my version from, because if you simply google “risk assessment matrix” you’ll find dozens of links describing the basic concept. The basic concept is this:

Rate the impact (or consequence) of something going wrong on a scale of 1 to 5, with 1 being effectively unnoticeable 5 being catastrophic.  Rate the likelihood (or probability) of something bad happening from 1 to 5, with 1 being very unlikely and 5 being almost certain. Multiply those together and you get a number that represents how risky it is on a scale from 1 to 25.

a 5x5 multiplication table, with low numbers labelled minimal risk and the highest numbers labelled critical risk

How many ambiguities and room for hand waving can you spot already?

Risk is not objective

One of the biggest problems with a system like this is that there’s a lot of room for interpreting what these scales mean. The numbers 1 to 5 are completely arbitrary so we have to attach some meaning to them. Even the Wikipedia article on risk matrices eschews numbers entirely, using instead qualitative markers laid out in a 5×5 look-up table.

The hardest part of this for me and the team was dealing with the fact that neither impact nor probability are the same for everybody. For impact, I used three different scales to illustrate how different people might react based on impact:

To someone working in operations:

  1. Well that’s annoying
  2. This isn’t great but at least it’s manageable
  3. At least things are only broken internally
  4. People are definitely going to notice something is wrong
  5. Everything is on fire!

To our clients:

  1. It’s ok if it doesn’t work, we’re not using it
  2. It works for pretty much everything except…
  3. I guess it’ll do but let’s make it better
  4. This doesn’t really do what I wanted
  5. This isn’t at all what I asked for

And to us, the developers:

  1. Let’s call this a “nice-to-have” and fix it when there’s time
  2. We’ll put this on the roadmap
  3. We’ll bump whatever was next and put it in the next sprint
  4. We need to get someone on this right away
  5. We need to put everything on this right now

You could probably also frame these as performance impact, functional impact, and project impact. Later iterations adjusted the scales a bit and put in more concrete examples; anything that resulted in lost data for a client, for example, was likely to fall into the maximum impact bucket.

Interestingly, in a recent talk Angie Jones extended the basic idea of a 5×5 to include a bunch of other qualities as a way of deciding whether a test is worth automating. In her scheme, she uses “how quickly would this be fixed” as one dimension of the value of a test, whereas I’m folding that into the impact on the development team. I hadn’t seen other variations of the 5×5 matrix when coming up with these scales, and to me the most direct way of making a developer feel the impact of a bug was to ask whether they’d have to work overtime to fix it.

Probability was difficult in its own way as well. We eventually adopted a scale with each bucket mapping to a ballpark percentage chance of a bug being noticed, but even a qualitative scale from “rare” through to “certain” misses a lot of nuance. How do you compare something that will certainly be noticed by only one client to something that low chance of manifesting for every client? I can’t say we ever solidified a good solution to this, but we got used to whatever our de-facto scale was.

How testing factors in

We discussed the ratings we wanting to give each ticket on impact and probability of problems at the beginning of each sprint. These discussions would surface all kinds of potential bugs, known troublesome areas, unanswered questions, and ideas of what kind of testing needed to be done.

Inevitably, when somebody explained their reasoning for assigning a higher impact than someone else by raising a potential defect, someone else would say “oh, but that’s easy to test for.” This was great—everybody’s thinking about testing!—but it also created a tendency to downplay the risk. Since a lower risk item should do with less thorough testing, we might not plan to do the testing required to justify the low risk. Because of that, we added a caveat to our estimates: we estimated what the risk would be if we did no testing beyond, effectively, turning the thing on.

With that in mind, a risk of 1 could mean that one quick manual test would be enough to send it out the door. The rare time something was rated as high as 20 or 25, I would have a litany of reasons sourced from the team as to why we were nervous about it and what we needed to do to mitigate that. That number assigned to “risk” at the end of the day became a useful barometer for whether the amount of testing we planned to do was reasonable.

Beyond testing

Doing this kind of risk assessment had positive effects outside of calibrating our testing. The more integrated testing and development became, the more clear it was that management couldn’t just blame testing for long timelines on some of these features. I deliberately worked this into how I wanted the risk scale to be interpreted, so that it spoke to both design and testing:

Risk  Interpretation
1-4 Minimal: Can always improve later, just test the basics.
5-10 Moderate: Use a solution that works over in-depth studies, test realistic edge cases, and keep estimates lean.
12-16 Serious: Careful design, detailed testing on edges and corners, and detailed estimates on any extra testing beyond the norm.
20-25 Critical: In-depth studies, specialized testing, and conservative estimates.

These boundaries are always fuzzy, of course, and this whole thing has to be evaluated in context. Going back to Angie Jones’s talk, she uses four of these 5×5 grids to get a score out of 100 for whether a test should be automated, and the full range from 25-75 only answers that question with “possibly”. I really like how she uses this kind of system as a comparison against her “gut check”, and my team used this in much the same way.

The end result

Although I did all kinds of fun stuff with comparing these risk estimates against the story points  we put on them, the total time spent on the ticket, and whether we were spending a reasonable ratio of time on test to development, none of that ever saw practical use beyond “hmmm, that’s kind of interesting” or “yeah that ticket went nuts”. Even though I adopted this tool as a way of responding to pressure from management to justify timelines, they (thankfully) rarely ended up asking for these metrics either. Once a ticket was done and out the door, we rarely cared about what our original risk estimate was.

On the other side, however, I credit these (sometimes long) conversations with how smoothly the rest of our sprints would go; everybody not only had a good understanding of what exactly needed to be done and why, but we arrived at that understanding as a group. We quantified risk to put a number into a tracking system, but the qualitative understanding of what that number meant is where the value lay.