In defence of time over story points

I have to admit, there was a time when I was totally on board with estimating work in “story points”. Briefly I was the resident point-apologist around town, explaining metaphors about how points are like the distance of a race that people complete in different times. These days, while estimating complexity has its uses, I’m coming to appreciate those old fashioned time estimates.

Story points are overrated. Here’s a few of the reasons why I think so. Strap yourselves in, this is a bit of a rant. But don’t worry, I’ll hedge at the end.

The scale is arbitrary and unintuitive

How do you measure complexity? What units do you use? Can you count the number of requirements, the acceptance criteria, the number of changes, the smelliness of the code to be changed, the number of test cases required, or the temperature of the room after the developers have debated the best implementation?

To avoid that question, story points use an arbitrary scale with arbitrary increments. It could be the Fibonacci sequence, powers of two, or just numbers 1 through 5. That itself is not necessarily a problem — Fahrenheit and Celsius are both arbitrary scales that measure something objective — but if you ask 10 developers what a “1” means you’ll get zero answers if they haven’t used points yet and 20 answers 6 months later.

I don’t know anybody who has an intuition for estimating “complexity” because there’s no scale for it. There’s nothing to check it against. Meanwhile we’ve all been developing an intuition for time every since we started asking “are we there yet?” from the back of the car or complaining that it wasn’t late enough for bedtime.

People claim that you can build your scale by taking the simplest task as a “1” and going from there. But complexity doesn’t scale like that. What’s twice as complicated as, say, changing a configuration value? Even if you compare tickets being estimated with previous ones, you’re never going to place it in an ordered list (even if binned) of all previous tickets. You’re guaranteed to have some that are more “complex” than others rated at lower points because you were feeling confident that day or didn’t have a full picture of the work. (Though if you do try this, it can give you the side benefit of questioning whether those old tickets really deserve the points they got.)

It may not be impossible to get a group of people to come to a common intuition around estimating complexity, but it sure takes a lot longer than agreeing on how long a day or a week is. Even if you did reach that common understanding, nobody outside the team will understand it.

Points aren’t what people actually care about

People, be it either the business or dependent teams, need to schedule things. If we want to have goals and try to reach them, we have to have some idea of how much we have to do to get there and how much time it will take to do that work. If someone asks “when can we start work on feature B” and you say “well feature A has 16 points”, their next question is “OK, and how long will that take?” or “and when will it be done?” Points don’t answer either question, and nobody is going to be happy if you tell them the question can’t be answered.

In practice (at least in my experience) people use time anyway. “It’ll only take an hour so I’m giving it one point”. “I’d want to spend a week on this so let’s give it 8 points.” When someone says “This is more complicated so we better give it more points” it’s because they’ll need more time to do it!

Maybe I care about complexity because complexity breeds risk and I’ll need to be more careful testing it. That’s fair, and a decent reason for asking the question, but it also just means you need more time to test it. Complexity is certainly one dimension of that but it isn’t the whole story (impact and probability of risks manifested are others).

Even the whole premise of points, to be able to measure the velocity of a team, admits that time is the important factor. Velocity matters because it tells you how much you work you can reasonably put into your sprint. But given a sprint length you already know how many hours you can fit into a sprint. What’s the point of going around the bush about it?

Points don’t provide feedback

Time provides has a built in feedback that points can’t. That’ll take me less than a day, I say. Two days later we have a problem.

Meanwhile I say something is 16 story points. Two days later it isn’t done… do I care? Am I running behind? What about 4 weeks later? Was this really a 16 point story or not? Oh, actually, someone was expecting it last Thursday? That pesky fourth dimension again!

Points don’t avoid uncertainty

I once heard someone claim that one benefit of story points is that they don’t change when something unexpected comes up. In one sense that’s true, but only if there’s no feedback on the actual value of points. Counterexamples are about as easy to find as stories themselves.

Two systems interact with each other in a way the team didn’t realize. Someone depends on the legacy behaviour so you need to add a migration plan. The library that was going to make the implementation a single line has a bug in it. Someone forgot to mention a crucial requirement. There are new stakeholders that need to be looped in. Internet Explorer is a special snowflake. The list goes on, and each new thing can make something more complex. If they don’t add complexity after you’ve assigned a number, what creates the complexity in the first place?

Sure you try to figure out all aspects of the work as early as possible, maybe even before it gets to the point of estimating for a sprint. Bring in the three amigos! But all the work you do to nail down the “complexity” of a ticket isn’t anything special about “complexity” as a concept, it’s exactly the same kind of work you’d do to refine a time estimate. Neither one has a monopoly on certainty.

Points don’t represent work

One work ticket might require entering configurations for 100 clients for a feature we developed last sprint. It’s dead simple brainless work and there’s minimal risk beyond copy-paste errors that there are protections for anyway. Complexity? Nah, it’s one point, but I’ll spend the whole sprint doing it.

Another work ticket is replacing a legacy piece of code to support an upcoming batch of features. We know the old code tends to be buggy and we’ve been scared to touch it for years because of that. The new version is already designed but it’ll be tricky to plug in and test thoroughly to make sure we don’t break anything in the process. Not a big job—it can still be done in one sprint—but relatively high risk and complex. 20 points.

So wait, if both of those fit in one sprint, why do I care what the complexity is? There are real answers to that, but answering the question of how much work it is isn’t one of them. If you argue that those two examples should have similar complexity since they both take an entire sprint, then you’re already using time as the real estimator and I don’t need to convince you.

Points are easily manipulated

Like any metric, we must question how points can be manipulated and whether there’s incentive to do so.

In order to see increase in velocity, you have to have a really well understood scale. The only way to calibrate that scale without using a measurable unit is to spend months “getting a feel for it”.

Now if you’re looking for ways to increase your velocity, guaranteed the cheapest way to do that (deliberately or not) is to just start assigning more points to things. Now that the team has been at this for a while, one might say, they can better estimate complexity. Fewer unknowns mean more knowns, which are more things to muddy the discussion and push up those complexity estimates. (Maybe you are estimating more accurately, but how can you actually know that?) Voila. Faster velocity brought to you in whole by the arbitrary, immeasurable, and subjective nature of points.

Let’s say we avoid that trap, and we actually are getting better at the work we’re doing. Something that was really difficult six months ago can be handled pretty quickly now without really thinking about it. Is that ticket still as complex as it was six months ago? If the work hasn’t changed it should be, but it sure won’t feel as complex. So is your instinct going to be to put the same points on it? Velocity stagnates even though you’re getting more done. Not only can velocity be manipulated through malice, it doesn’t even correlate with the thing you want to measure!

It’s a feature, not a bug

One argument I still anticipate in favour of points is that the incomprehensibility of them is actually a feature, not a bug. It’s arbitrary on purpose so that it’s harder for people outside the team to translate them into deadlines to be imposed onto that team. It’s a protection mechanism. A secret code among developers to protect their own sanity.

If that’s the excuse, then you’ve got a product management problem, not an estimation problem.

In fact it’s a difficulty with metrics, communication, and overzealous people generally, not something special about time. The further metrics get from the thing they measure, the more likely they are to be misused. Points, if anybody understood them, would be just as susceptible to that.

A final defence of complexity

As far as a replacement for estimating work in time, story points are an almost entirely useless concept that introduces more complexity than it estimates. There’s a lot of jumping through hoops and hand waving to make it look like you’re not estimating things in time anymore. I’d much rather deal in a quantity we actually have units for. I’m tempted to say save yourself the effort, except for one thing: trying to describe the complexity of proposed work is a useful tool for fleshing out what the work actually requires and to get everybody on an equal footing understanding that work. That part doesn’t go away, though the number you assign to it might as well. Just don’t pretend it’s more meaningful than hours on a clock.

Little Alchemy: A silly little test exercise

Lately I’ve been entertaining myself with a silly little mobile app called Little Alchemy 2, a simple revamp of an earlier concept game. It distracted me enough this week that I didn’t prepare a respectable blog post. While questioning my life choices, I started to convince myself that maybe I had been playing at testing this whole time. Testers, myself included, love to make analogies to science, and this is a game that simulates the precursor to modern chemistry. Of course it can be used as a tortured metaphor for software testing! Follow me on my journey…

Exploration

Screenshot of Little AlchemyThe main concept of the game is to drag icons of real life objects on top of each other, combining them to make new items.

At first there are only a few options: Water, Earth, Fire, Air. The classics. Not hard to just play around with different combinations and see how this game works.

Fire + Water = Steam. Makes sense.

Water + Earth = Mud. Great.

Now I wonder what happens if I combine steam and mud….

After a while you start to get a sense for how this universe works.

Developing heuristics

After a while I start to see patterns. If I combine the same thing with itself I get a bigger version of that thing; can I do that with everything? Fire seems transform most items. When I find out how to make paper or wood, I’m probably going to try to throw those on the fire too.

Combinations quickly multiply.

Before long I have hundreds of items of all sorts. Too many to keep in my head. I start to think I’m probably missing combinations, or I’ve forgotten what combinations I’ve already tried. I know I tried putting wood into fire but I did I put wood into the campfire? Am I missing some obvious combination that’s right in front of me?

Using an Oracle

That’s when I start googling around for hints and suggestions. This part gets a bit cheaty, but at the time it was what I needed to do to get a handle on the problem and keep making progress. I found a site that would randomly show me an item, and I could use that to see if I had the pieces I needed to make it. No more guessing, I was given the right answer.

I suppose this is where I could veer off into automation, but where’s the fun in that? After a while, I started to exhaust the hint system anyway; with only random items the ratio of new to known items started to decline. ROI went down. My oracle was getting less useful, not telling me anything new.

Brute force

There were still an intractable number of items and I had seen enough unexpected combinations that I didn’t trust myself to reason them all out myself. So instead, I turned to brute force.

First item on the list. Try to combine with every item below it. Repeat.

Now I should really think about automation if my goal was to just find all combinations. This is a pure algorithm with finite steps since the game promises a finite number of items. But things start to change after using this for a few iterations manually. Happily the game removes items that have no undiscovered combinations, so in theory the complexity will peak and game will start get simpler again. (Wouldn’t it be nice if software always got simpler with time?) Moreover, a little bit of brute force starts to show patterns that I hadn’t been aware of before. I start to skip ahead: “aha! if that combination works, then I bet this other one will too…” One new strategy begets another!

Inference and deduction

I reach a breaking point where items are being removed from play more and more often. This feels like the home stretch but brute forcing it will still take ages. Often, though, an item only has one or two combinations left to discover before it’s exhausted. I use this to my advantage.

Enter another oracle of sorts; this time, it’s the documentation of everything I’ve done so far. For any item still on the board, the game tells me how many undiscovered combinations an item has, items that have been used to make it, and all the other items it has been used to make so far. This is all data I can analyse to look for yet more patterns, and spot gaps in patterns that have been missed so far. The rate at which I clear items off the board goes way up, and I’m still using the same manual interactions I’ve used all along.

End game

I’m not there yet. Do I still have more to learn? Is another strategy going to make itself obvious before I finish the game, or an old one resurface as the dynamics change?

And what am I going to explore next?

Agile Testing book club: Goals

This is the fourth part in my series highlighting some of the lessons I’m taking from reading Agile Testing by Lisa Crispin and Janet Gregory. Other entries in the series can be found here.

Chapter 5 had a lot of interesting things about metrics and strategy that I’ve taken notes on and have planned to incorporate into my work. I’m working on another post about metrics that will probably draw on some of the things discussed, but before I get into the nitty gritty of metrics I want to stay a bit higher level. The part that I want to focus on now was about having a goal or a purpose.

When you’re trying to figure out what to measure, first understand what problem you’re trying to solve.
— Agile Testing by Lisa Crispin and Janet Gregory, Chapter 5

This is very much connected to other ideas I’ve talked about before, like letting the team feel pain; pain provides a tangible goal and gives anything you do to address it a clear purpose.

This also put something from Chapter 4 into another context. Lisa and Janet talked about using retrospectives to figure out if a problem could benefit from hiring more testers. I’ve thought about this a few times in my current role: does the team need another tester?

The answer has to come from considering what led you to ask the question in the first place. If your team is limited just by the number of hours in a day and the number of people you have, then you probably want to ask about what skills would be best to add to the team, be it testing or otherwise. If the team is struggling on testing specifically, then another tester may just be a band-aid; could the team be better served by working on building everybody’s testing skills? What problem are you trying to solve? The answer is never “we have the wrong tester to developer ratio.”

It’s also a great grounding question. Last week I wrote about how important it is to identify specifically what you’re trying to test at any moment. The motivation is the same. If you find yourself struggling with too many variables in the air and a hundred contingencies and tangents about all kinds of other things, stop for a moment and ask: “What problem am I trying to solve?” or “What problem do I need to solve right now?”

Qualifying quantitative risk

Let’s start with quantifying qualitative risk first.

Ages ago I was under pressure from some upper management to justify timelines, and I found a lot of advice about using risk as a tool not only to help managers see what they’re getting from the time spent developing a feature (i.e, less risk) but also to help focus what testing you’re doing. This was also coming hand in hand with a push to loosen up our very well defined test process, which came out of very similar pressure. I introduced the concept of a risk assessment matrix as a way of quantifying risk, and it turned out to be a vital tool for the team in planning our sprints.

Five by five

I can’t the original reference I used to base my version from, because if you simply google “risk assessment matrix” you’ll find dozens of links describing the basic concept. The basic concept is this:

Rate the impact (or consequence) of something going wrong on a scale of 1 to 5, with 1 being effectively unnoticeable 5 being catastrophic.  Rate the likelihood (or probability) of something bad happening from 1 to 5, with 1 being very unlikely and 5 being almost certain. Multiply those together and you get a number that represents how risky it is on a scale from 1 to 25.

a 5x5 multiplication table, with low numbers labelled minimal risk and the highest numbers labelled critical risk

How many ambiguities and room for hand waving can you spot already?

Risk is not objective

One of the biggest problems with a system like this is that there’s a lot of room for interpreting what these scales mean. The numbers 1 to 5 are completely arbitrary so we have to attach some meaning to them. Even the Wikipedia article on risk matrices eschews numbers entirely, using instead qualitative markers laid out in a 5×5 look-up table.

The hardest part of this for me and the team was dealing with the fact that neither impact nor probability are the same for everybody. For impact, I used three different scales to illustrate how different people might react based on impact:

To someone working in operations:

  1. Well that’s annoying
  2. This isn’t great but at least it’s manageable
  3. At least things are only broken internally
  4. People are definitely going to notice something is wrong
  5. Everything is on fire!

To our clients:

  1. It’s ok if it doesn’t work, we’re not using it
  2. It works for pretty much everything except…
  3. I guess it’ll do but let’s make it better
  4. This doesn’t really do what I wanted
  5. This isn’t at all what I asked for

And to us, the developers:

  1. Let’s call this a “nice-to-have” and fix it when there’s time
  2. We’ll put this on the roadmap
  3. We’ll bump whatever was next and put it in the next sprint
  4. We need to get someone on this right away
  5. We need to put everything on this right now

You could probably also frame these as performance impact, functional impact, and project impact. Later iterations adjusted the scales a bit and put in more concrete examples; anything that resulted in lost data for a client, for example, was likely to fall into the maximum impact bucket.

Interestingly, in a recent talk Angie Jones extended the basic idea of a 5×5 to include a bunch of other qualities as a way of deciding whether a test is worth automating. In her scheme, she uses “how quickly would this be fixed” as one dimension of the value of a test, whereas I’m folding that into the impact on the development team. I hadn’t seen other variations of the 5×5 matrix when coming up with these scales, and to me the most direct way of making a developer feel the impact of a bug was to ask whether they’d have to work overtime to fix it.

Probability was difficult in its own way as well. We eventually adopted a scale with each bucket mapping to a ballpark percentage chance of a bug being noticed, but even a qualitative scale from “rare” through to “certain” misses a lot of nuance. How do you compare something that will certainly be noticed by only one client to something that low chance of manifesting for every client? I can’t say we ever solidified a good solution to this, but we got used to whatever our de-facto scale was.

How testing factors in

We discussed the ratings we wanting to give each ticket on impact and probability of problems at the beginning of each sprint. These discussions would surface all kinds of potential bugs, known troublesome areas, unanswered questions, and ideas of what kind of testing needed to be done.

Inevitably, when somebody explained their reasoning for assigning a higher impact than someone else by raising a potential defect, someone else would say “oh, but that’s easy to test for.” This was great—everybody’s thinking about testing!—but it also created a tendency to downplay the risk. Since a lower risk item should do with less thorough testing, we might not plan to do the testing required to justify the low risk. Because of that, we added a caveat to our estimates: we estimated what the risk would be if we did no testing beyond, effectively, turning the thing on.

With that in mind, a risk of 1 could mean that one quick manual test would be enough to send it out the door. The rare time something was rated as high as 20 or 25, I would have a litany of reasons sourced from the team as to why we were nervous about it and what we needed to do to mitigate that. That number assigned to “risk” at the end of the day became a useful barometer for whether the amount of testing we planned to do was reasonable.

Beyond testing

Doing this kind of risk assessment had positive effects outside of calibrating our testing. The more integrated testing and development became, the more clear it was that management couldn’t just blame testing for long timelines on some of these features. I deliberately worked this into how I wanted the risk scale to be interpreted, so that it spoke to both design and testing:

Risk  Interpretation
1-4 Minimal: Can always improve later, just test the basics.
5-10 Moderate: Use a solution that works over in-depth studies, test realistic edge cases, and keep estimates lean.
12-16 Serious: Careful design, detailed testing on edges and corners, and detailed estimates on any extra testing beyond the norm.
20-25 Critical: In-depth studies, specialized testing, and conservative estimates.

These boundaries are always fuzzy, of course, and this whole thing has to be evaluated in context. Going back to Angie Jones’s talk, she uses four of these 5×5 grids to get a score out of 100 for whether a test should be automated, and the full range from 25-75 only answers that question with “possibly”. I really like how she uses this kind of system as a comparison against her “gut check”, and my team used this in much the same way.

The end result

Although I did all kinds of fun stuff with comparing these risk estimates against the story points  we put on them, the total time spent on the ticket, and whether we were spending a reasonable ratio of time on test to development, none of that ever saw practical use beyond “hmmm, that’s kind of interesting” or “yeah that ticket went nuts”. Even though I adopted this tool as a way of responding to pressure from management to justify timelines, they (thankfully) rarely ended up asking for these metrics either. Once a ticket was done and out the door, we rarely cared about what our original risk estimate was.

On the other side, however, I credit these (sometimes long) conversations with how smoothly the rest of our sprints would go; everybody not only had a good understanding of what exactly needed to be done and why, but we arrived at that understanding as a group. We quantified risk to put a number into a tracking system, but the qualitative understanding of what that number meant is where the value lay.

The column that shall not be named

Until recently my team had a very simple set up for our sprint board, with just four steps: Tickets would start as “To Do”, move to “In Progress” when somebody started working on it, then put into “Code Review”, and when everything looked good the code was merged and the ticket marked “Done”.

Four columns, labelled To Do, In Progress, Code Review, and Done.

This worked well until we started to have issues with tickets after merging. A separate team was responsible for building a release and deploying it. As we started to put more tickets through we started to find occasional issues at this stage, either just before deploying or soon after. These didn’t go unnoticed for long — we regularly check all our features in production — but they did show that we needed to pay more attention to the work between merging it and deploying it.

The natural inclination of the team was to do what most other teams in the department have already done: insert a “QA” column between Code Review and Done, where we can test the final build for anything unexpected.

The same four columns, with a QA column added between Code Review and Done

I wasn’t on board with this plan. I think it’s too easy to start thinking that QA only happens when a ticket is in that column. Even if everybody says they know that’s not true, the board can reinforce that idea anyway. Following Richard Bradshaw‘s cue, I tried to explain that QA is something that happens, or should happen, throughout development. The problem is, I didn’t have a good alternative name for that new column:

QA spanning across all five columns, with an unnamed column between Code Review and Done

To me, “Testing” is just as limiting as “QA” (should I not do any testing when something is in code review?) “Verification”, the other candidate, to me seemed too narrow for what would go on in that column. I would have been happy with anything but “QA”. But I’m not sure I really made my point—or they took it too literally—because we ended up with this as our board:

The former Testing column as been re-titled The Column That Shall Not Be Named

And from there it was a small step to this:

The Done column has been renamed to Yer a Wizard, Harry

This has had one delightful side-benefit: at our daily stand-ups, instead of asking if a ticket is done yet, we can now legitimately ask: “Are you a wizard yet?”