Dynamically create test cases with Robot Framework

In Robot Framework, there isn’t an obvious built-in way to create a list of tests to execute dynamically. I recently faced a case where I wanted to do this, and happily Bryan Oakley (blog, twitter, github) was able to help me through the problem. I’ve seen a few people with similar problems so thought it would be useful to document the solution.

Use the subheadings to skip down to the solution if you don’t want the backstory.

Why would I want to do this

Normally I’m against too much “magic” in test automation. I don’t like to see expected values calculated or constructed with a function that’s just as likely to have bugs as the app being tested, for example. I’ve seen tests with assertions wrapped in for loops that never check whether we actually did greater than zero assertions. Helper functions have an if/else to check two variations of similar behaviour and the test passes, but I can’t tell which of the two cases it thinks it found or whether that was the intended one. When you write a test case you should know what you’re expecting, so expect it. Magic should not be trusted.

But sometimes I need a little magic.

The problem I had was that I wanted to check that some background code was executing properly every time the user selected an option from a list, but the items in that list could be changed by another team at any time. It wasn’t sufficient to check that one of the items worked, or that a series of fake items, because I wanted to know that the actual configuration of each item in the real list was consistent with what our code expected. I’m basically testing the integration, but I would summarize it like this: “I want to test that our code properly handles every production use case.”

Importantly, though, I don’t just care that at least one item failed, I care how many items failed and which ones. That’s the difference between looping over every item within a test case and executing a new case for each one. Arguably this is just a reporting problem, and certainly I can drill down into the reports if I did this all with a loop in one test case, but I would rather have the most relevant info front and center.

The standard (unmaintainable) solution

Robot Framework does provide a way of using Test Templates and for-loops to accomplish something like this: given a list, it can run the same test on each item in the list. For 10 items, the report will tell you 10 passed, 10 failed, or somewhere in between. This works well if you know in advance which items you need to test:

*** Settings ***
Test Template    Some test keyword

*** Test Cases ***
:FOR    ${i}    IN RANGE     10
\    ${i}

This runs Some test keyword ten times, using the numbers 0 to 9 as arguments, which you’d define to click on the item index given and make whatever assertions you need to make. Of course as soon as the list changes to 9 or 11 items, this will either fail or silently skip items. To get around this, I added a teardown step to count the number of items in the list and issue a failure if it didn’t match the expected list. Still not great.

The reporting still leaves a bit to be desired, as well. It’s nicer to list out each case with a descriptor, like so:

*** Test Cases ***
Apples     0
Oranges    1
Bananas    2

We get a nice report that tells us that Apples passed but Oranges and Bananas failed. Now I can easily find which thing failed without counting items down the list, but you can see that this is even more of a maintenance nightmare. As soon as the order changes, my report is lying to me.

A failed intermediate option

When I brought this question up to the Robot Framework slack user group, Bryan suggested I look into using Robot’s visitor model and pre-run modifiers. Immediately this was over my head. Not being a comp-sci person, this was the first I had heard of the visitor pattern, but being some who always wants to learn this immediately sent me down a Wikipedia rabbit hole of new terminology. The basic idea here, as I understand it, is to write a modifier that would change a test suite when it starts. Bryan provided this example:

from robot.api import SuiteVisitor

class MyVisitor(SuiteVisitor):

    def __init__(self):
    def start_suite(self, suite):
        for i in range(3):
            tc = suite.tests.create(name='Dynamic Test #%s' % i)
            tc.keywords.create(name='Log', args=['Hello from test case #%s' % i])

# to satisfy robot requirement that the class and filename
# are identical
visitor = MyVisitor

This would be saved in a file called “visitor.py”, and then used when executing the suite:

robot --prerunmodifier visitor.py existing_suite.robot

I ran into problems getting this working, and I didn’t like that the pre-run modifier would apply to every suite I was running. This was just one thing I wanted to do among many other tests. I didn’t want to have to isolate this from everything else to be executed in its own job.

My next step to make this more flexible was to adapt this code into a custom python keyword. That way, I could call it from a specific suite setup instead of every suite setup. The basic idea looked like this:

tc = BuiltIn()._context.suite.tests.create(name="new test")

but I couldn’t get past a TypeError being thrown from the first line, even if I was willing to accept the unsupported use of _context. While I was trying to debug that, Bryan suggested a better way.

Solution: Adding test cases with a listener

For this, we’re still going to write a keyword that uses suite.tests.create() to add test cases, but make use of Robot’s listener interface to plug into the suite setup (and avoid _context). Again, this code comes courtesy of Bryan Oakley, though I’ve changed the name of the class:

from __future__ import print_function
from robot.running.model import TestSuite

class DynamicTestCases(object):

    def __init__(self):
        self.ROBOT_LIBRARY_LISTENER = self
        self.current_suite = None

    def _start_suite(self, suite, result):
        # save current suite so that we can modify it later
        self.current_suite = suite

    def add_test_case(self, name, kwname, *args):
        """Adds a test case to the current suite

        'name' is the test case name
        'kwname' is the keyword to call
        '*args' are the arguments to pass to the keyword

            add_test_case  Example Test Case  
            ...  log  hello, world  WARN
        tc = self.current_suite.tests.create(name=name)
        tc.keywords.create(name=kwname, args=args)

# To get our class to load, the module needs to have a class
# with the same name of a module. This makes that happen:
globals()[__name__] = DynamicTestCases

This is how Bryan explained it:

It uses a couple of rarely used robot features. One, it uses listener interface #3, which passes actual objects to the listener methods. Second, it uses this listener as a library, which lets you mix both a listener and keywords in the same file. Listener methods begin with an underscore (eg: `_start_suite`), keywords are normal methods (eg: `add_test_case`). The key is for `start_suite` to save a reference to the current suite. Then, `add_test_case` can use that reference to change the current test case.

Once this was imported into my test suite as a library, I was able to write a keyword that would define the test cases I needed on suite setup:

Setup one test for each item
    ${numItems}=    Get number of items listed
    :FOR    ${i}    IN RANGE    ${numItems}
    \     Add test case    Item ${i}
    \     ...              Some test keyword    ${i}

The first line of the keyword gets the number of items available (using a custom keyword for brevity), saving us the worry of what happens when the list grows or shrinks; we always test exactly what is listed. The FOR loop then adds one test case to the suite for each item. In the reports, we’ll see the tests listed as “Item 0”, “Item 1”, etc, and each one will execute the keyword Some test keyword with each integer as an argument.

I jazzed this up a bit further:

Setup one test for each item
    ${numItems}=    Get number of items listed
    ${items}=       Get webelements    ${itemXpath}
    :FOR    ${i}    IN RANGE    ${numItems}
    \   ${itemText}=    Set variable
    \   ...             ${items[${i}].get_attribute("text")}
    \   Add test case   Item ${i}: ${itemText}
    \   ...             Some test keyword ${i}

By getting the text of the WebElement for each item, I can set a more descriptive name. With this, my report will have test cases name “Item 0: Apple”, “Item 1: Orange”, etc. Now the execution report will tell me at a glance how many items failed the test, and which ones, without having to count indices or drill down further to identify the failing item.

The one caveat to this is that Robot will complain if you have a test suite with zero test cases in it, so you still need to define one test cases even if it does nothing.

*** Settings ***
Library        DynamicTestCases
Suite setup    Setup one test for each item

*** Test cases ***
Placeholder test
    Log    Placeholder test required by Robot Framework

*** Keywords ****
Setup one test for each item

You can not, unfortunately, use that dummy test to run the keyword to add the other test cases. By the time we start executing tests, it’s too late to add more to the suite.

Since implementing the DynamicTestCases library, my suite has no longer been plagued with failures caused only by another team doing their job. I’m now testing exactly what is listed at any given moment, no more and no less. My reports actually give me useful numbers on what is happening, and they identify specifically where problems were arising. I still have some safety checks in place on teardown to ensure that I don’t fail to test anything at all, but these have not flagged a problem in weeks.

As long as there’s a good use case for this kind of magic, I hope it is useful to others as well.

The Phoenix Project & the value of anecdotes

Book cover of The Phoenix ProjectI generally have a hard time reading non-fiction books. Though I always love learning, I rarely find them engaging enough that I look forward to continuing them the way I do with a novel. If the topic is related to work, I usually find myself wanting to take notes, but I usually only make time for reading in bed as I’m going to sleep. Even if I wasn’t going to take notes, I don’t like keeping my mind on work 24/7. Downing a cup of coffee just before bed has about the same effect as starting to contemplate “software testing this”, “Agile that”, and “product management the other”.

That’s why I was interested to give The Phoenix Project a try. It’s billed as a novel about DevOps, and was recommended by a coworker as providing some interesting insights. I decided to approach it like a novel, without thinking too much about it being about work-related, and put it on my nightstand.

The book starts with Bill, the IT manager of a struggling manufacturing company, and follows his attempts to make improvements to how IT is handled. I liked it immediately because the worst-case scenario that Bill starts the story in was eerily familiar to some of my own experience in less-than-healthy organizations.

The book reads quite well, though there are times where it feels more like a contrived example from a management textbook than a novel. This is especially true whenever the character Erik appears. His primary role is to show Bill the error of his ways, usually by pointing out that IT work—development and operations—are more like manufacturing he would previously have admitted. This really strains credulity when Erik gives Bill challenges like “call me when you figure out what the four kinds of work are.” As if there is a unique way of describing types of work with exactly and only four categories. But Bill does figure them out, correctly, and the plot moves on. Erik’s “Three Ways” were given a similar treatment. There are a lot of good lessons in this book, but I doubt anybody who’s been around the block a few times will believe how naturally Bill divines these concepts from first principles. Nor how easily he puts them into practice to turn the whole IT culture around, with just one token protest character that gets quickly sidelined.

Nonetheless, I do think the book achieved what it set out to do. The two major themes are worth outlining.

Four types of work

Early on Bill talks about how he needs to get “situational awareness” of where the problems are. One of the tools he arrives at, with Erik’s prompting, is visualizing the work his department is doing, which makes it obvious that there are types of work that had previously been invisible. The categories he identifies (after Erik told him that there were exactly four) are:

  1. Business Projects – the stuff that pays the bills.
  2. Internal IT Projects – Infrastructure changes, improvements, and technical debt would fall here.
  3. Changes – “generated from the previous two”, though honestly I don’t see why this exists independently of the first two.
  4. Unplanned work – arising from incidents or problems that need to be addressed at the expense of other work.

Certainly I like the lesson of visualizing all the work you’re doing and identifying where it comes from, but I don’t understand this particular breakdown. Is a “change” to an existing system really a whole separate category of work from a new project? What value does separating that provide? If the distinction between business and IT projects is significant, why isn’t there a difference between changes to business features and internal IT changes?

It wasn’t clear to me if these four types of work are something that can be found in the literature, where there might be some more justification for them, or if they are an invention of the authors. I’d be interested in learning more about the history here, if there is any. For what it’s worth, given the same homework of identifying four types of work from the hints Erik gave in the book, I might have broken it into either business or internal, and either planned or unplanned.

The Three Ways

These are presented as the guiding principles for DevOps. They’re introduced just as heavy-handedly as the four ways of work. At one point they go as far as saying that anybody managing IT without talking about the three ways is “managing IT on dangerously faulty assumptions”, which sounds like nonsense to me, especially given that they aren’t described as assumptions at all. Even still, the idea holds a bit more water for me. I can even see approaching these as series of prerequisites, each building on the next, or a maturity model (as long as you’re not the kind of person who is allergic to the words “maturity model”).

The three ways are:

  1. The flow of work from development, through operations, to the customer. I would add something like “business needs” to the front of that chain as well. The idea here is to make work move as smoothly and efficiently as possible from left to right, including limiting work in progress, catching defects early, and our old favourite CI/CD.
  2. The flow of feedback from customers back to the left, to be sure we prevent problems and make changes earlier in the pipeline to save work downstream, fueled by constant monitoring. The book includes fast automated test suites here, though to me that sounds more like part of the first way.
  3. This is where I got a bit lost; the third way is actually two things, neither of them “flows” or “ways” in the same sense as the first two. Part one is continual experimentation, part two is using repetition and practice as a way to build mastery. According to the authors, practice and repetition is what makes is possible to recover when experiments fail. To me, experiments shouldn’t require “recovery” when they fail at all, since they should be designed to be safe-to-fail in the first place. Maybe I would phrase this third way as constantly perturbing flows of the first two to find new maxima or minima.

What I like about these ways is that I can see applying them as a way to describe how a team is working and targeting incremental improvements. Not that a team needs to have 100% mastery of one way before moving on to the next (this is the straw man that people who rail against “maturity models” tend to focus on), but as a way of seeing some work as foundational to other work so as to target improvement where it will add the most integrity to the whole process. I’ve been thinking through similar ideas with respect to maturity of testing, and though I haven’t wrestled with this very much yet it feels promising.

Anecdotal value

I mentioned earlier that one weakness in the book is how easily everything changes for Bill as he convinces people around him to think about work differently. More to the point, it shows exactly one case study of organizational change. It’s essentially guaranteed that any person and organization with the same problems would react differently to the same attempts at change.

As I was thinking about what I wanted to say about this earlier this week, I saw this tweet from John Cutler:

It surprised me how thoroughly negative the feedback was in the replies. Many people seemed to immediately think that because solutions in one organization almost never translate exactly to another, hearing how others solved the same problem you’re having is of little value. Some asked how detailed the data would be. Others said they’d rather pay to hire the expert who studied those 50 cases as a consultant to provide customized recommendations for their org. Very few seemed to see it as a learning opportunity from which you could take elements of, say, 23 different solutions and tweak them to your context to form your own, while ignoring the elements and other solutions that aren’t as relevant to your own situation. Even if you hire the consultant as well, reading the case studies yourself will mean you will understand the consultant’s recommendations that much better, and can question them critically.

That’s why I think reading The Phoenix Project was worth it, even if it was idealized to the point where the main character seemingly has the Big Manager In The Sky speaking directly to him and everybody else treats him like a messiah (eventually). I can still take from it the examples that might apply to my work, internalize some of those lessons as I evaluate our own pipeline, and put the rest aside. Anecdotes can’t be scientific evidence that a solution to a problem is the right one, but these team dynamics aren’t an exact science anyway.

Seven things I learned at CAST

My CAST debrief from last week ended up being mostly about personal reflection, but I also wanted to call out the some of the things I picked up more specific to testing. After going through my notes I picked out what I thought where the important themes or interesting insights. There are many other tidbits, interactions, and ideas from those notes that I may write about later, these are just the top seven that I think will have an impact right away.

Each section here is paraphrased from the notes I took in each speaker’s session.

1. “Because we don’t like uncertainty we pretend we don’t have it”
Liz Keogh presented a keynote on using Cynefin as a way of embracing uncertainty and chaos in how we work. We are hard wired to see patterns where they don’t exist. “We treat complexity as if it’s predictable.” People experience the highest stress when shocks are unpredictable, so let’s at least prepare for them. To be safe to fail you have to be able to (1) know if it works, (2) amplify it if it does work, (3) know of it fails, (4) dampen it if it does fail, and (5) have a realistic reason for thinking there will be a positive impact in the first place. It’s not about avoiding failure, it’s about being prepared for it. The linked article above is definitely worth the read.

2. Testers are designers
From John Cutler’s keynote: “Design is the rendering of intent”. Anybody who makes decisions that affect that rendering is a designer. What you test, when you test, who you include, your mental models of the customer, and how we talk about what we find all factor in. UX researchers to have to know that they aren’t the user, but often the tester has used the product more than anybody else in the group. There’s no such thing as a design solution, but there is a design rationale. There should be a narrative around the choices we make for our products, and testers provide a ton of the information for that narrative. Without that narrative, no design was happening (nor any testing!) because there’s no sense of the intent.

3. Knowledge goes stale faster than tickets on a board do
From Liz Keogh’s lightning talk: If testers are at their WIP limit and devs can’t push any more work to them, your team will actually go faster if you make the devs go read a book rather than take on new work that will just pile up. You lose more productivity in catching up on work someone has already forgotten about than you gain in “getting ahead”. In practice, of course, you should have the devs help the testers when they’re full and the testers help the devs when they have spare capacity. (By the way, it’s not a Kanban board unless you have a signal for when you can take on new work, like an empty slot in a column. That empty slot is the kanban.)

4. “No matter how it looks at first, it’s always a people problem” – Jerry Weinberg
I don’t recall which session this quote first came up in, but it became a recurring theme in many of the sessions and discussions, especially around coaching testing, communicating, and whole-team testing. Jerry passed away just before the conference started, and though I had never met him he clearly had a strong influence on many of the people there. He’s probably most often cited for his definition that “quality is value to some person”.

5. Pair testing is one of the most powerful techniques we have
Though I don’t think anybody said this explicitly, this was evident to me from how often the concept came up. Lisi Hocke gave a talk about pairing with testers from other companies to improve her own testing while cross-pollinating testing ideas and skills with the wider community. Amit Wertheimer cited pairing with devs as a great way to identify tools and opportunities to make their lives easier. Jose Lima talked about running group exploratory testing sessions and the benefits that brings in learning about the product and coaching testing. Of course coaching itself, I think, is a form of pairing so the tutorial with Anne-Marie Charrett contributed to this theme as well.  This is something that I need to do more of.

6. BDD is not about testing, it’s about collaboration
From Liz Keogh again: “BDD is an analysis technique, it’s not about testing.” It’s very hard to refactor English and non-technical people rarely read the cucumber scenarios anyway. She says to just use a DSL instead. If you’re just implementing cucumber tests but aren’t having a 3-amigos style conversation with the business about what the scenarios should be, it isn’t BDD. Angie Jones emphasized these same points when introducing cucumber in her automation tutorial as a caveat that she was only covering the automation part of BDD in the tutorial, not BDD itself. Though I’ve worked in styles that called themselves “behaviour driven”, I’ve never worked with actual “BDD”, and this was the first time I’ve heard of it being more than a way of automating test cases.

7. Want to come up with good test ideas? Don’t read the f*ing manual!
From Paul Holland: Detailed specifications kill creativity. Start with high level bullet points and brainstorm from there. Share ideas round-robin (think of playing “yes and“) to build a shared list. Even dumb ideas can trigger good ideas. Encourage even non-sensical ideas since users will do things you didn’t think of. Give yourself time and space away to allow yourself to be creative, and only after you’ve come up with every test idea you can should you start looking at the details. John Cleese is a big inspiration here.

Bonus fact: Anybody can stop a rocket launchThe bright light of a rocket engine taking off in the night
The “T-minus” countdown for rocker launches isn’t continuous, they pause it at various intervals and reset it back to checkpoints when they have to address something. What does this have to do with testing? Just that the launch of the Parker Solar Probe was the weekend after CAST. At 3:30am on Saturday I sat on the beach with my husband listening to the live feed from NASA as the engineers performed the pre-launch checks: starting, stopping, and resetting the clock as needed. I was struck by the fact that at “T minus 1 minute 55 seconds” one calm comment from one person about one threshold being crossed scrubbed the entire launch without any debate. There wouldn’t be time to rewind to the previous checkpoint at T-minus 4 minutes before the launch window closed, so they shut the whole thing down. I’m sure that there’s an analogy to the whole team owning gates in their CD pipelines in there somewhere!

CAST 2018 Debrief

Last week I was lucky enough to attend the Conference of the Association of Software Testing, CAST 2018. I had been to academic conferences with collaborators before, and a local STAR conference here in Toronto, but this was my first time travelling for a professional conference in testing. The actual experience ended up being quite trying, and I ended up learning as much about myself as about testing. I don’t feel the need to detail my whole experience here, but I will highlight the top 5 lessons I took away from it.

1. “Coaching” is not what I hoped it was

I’ve been hearing a lot about “coaching” as a role for testers lately. I went to both Anne-Marie Charrett‘s tutorial and Jose Lima‘s talk on the subject thinking that it was a path I wanted to pursue. I went in thinking about using as a tool to change minds, instill a some of my passion for testing into the people I work with, and building up a culture of quality. I came away with a sense of coaching as more of a discussion method, a passive enterprise available for those who want to engage in it and useless for the uninterested. I suspect those who work as coaches would disagree, but that was nonetheless my impression.

One theme that came up from a few people, not just the speakers, was a distinction between coaching and teaching. This isn’t something I really understand, and is likely part of why I was expecting something else from the subject. I taught university tutorials for several years and put a lot of effort into designing engaging classes. To me, what I saw described as coaching felt like a subset of teaching, a particular style of pedagogy, not something that stands in contrast to it. Do people still hear “teaching” and think “lecturing”? I heard “coaching testing” and expected a broader mandate of education and public outreach that I associate with “teaching”.

Specifically, I was looking for insight on breaking through to people who don’t like testing, and who don’t want to learn about it, but very quickly saw that “coaching” wasn’t going to help me with that. At least not on the level at which we got into it in within one workshop. I am sure that this is something that would be interesting to hash out in a (meta) coaching session with people like Anne-Marie and Jose, even James Bach and Michael Bolton: i.e. people who have much more knowledge about how coaching can be used than I do.

2. I’m more “advanced” than I thought

My second day at the conference was spent in a class billed as “Advanced Automation” with Angie Jones (@techgirl1908). I chose this tutorial over other equally enticing options because it looked like the best opportunity for something technically oriented, and would produce a tangible artefact — an advanced automated test suite — that I could show off at home and assimilate aspects of into my own automation work.

Angie did a great job of walking us through implementing the framework and justifying the thought process each step of the way. It was a great exercise for me to go through implementing a java test suite from scratch, including a proper Page Object Model architecture and a TDD approach. It was my first time using Cucumber in java, and I quite enjoyed the commentary on hiring API testers as we implemented a test with Rest-Assured.

Though I did leave with that tangible working automation artefact at the end of the day, I did find that a reverse-Pareto principle at play with 80% of the value coming from the last 20% of the time. This is what lead to my take away that I might be more advanced than I had thought. I still don’t consider myself an expert programmer, but I think I could have gotten a lot further had we started with a basic test case already implemented. Interestingly Angie’s own description for another workshop of hers say “It’s almost impossible to find examples online that go beyond illustrating how to automate a basic login page,” though that’s the example we spent roughly half the day on. Perhaps we’ve conflated “advanced” with “well designed”.

3. The grass is sometimes greener

In any conference, talks will vary both in quality generally and how much they resonate with any speaker specifically. I was thrilled by John Cutler‘s keynote address on Thursday — he struck many chords about the connection between UX and testing that align very closely with my own work — but meanwhile Amit Wertheimer just wrote that he “didn’t connect at all” to it. I wasn’t challenged by Angie’s advanced automation class but certainly others in the room were. This is how it goes.

In a multi-track conference, there’s an added layer that there’s other rooms you could be in that you might get more value from. At one point, I found myself getting dragged down in a feeling that I was missing out on better sessions on the other side of the wall. Even though there were plenty of sessions where I know I was in the best room for myself, the chatter on Twitter and the conference slack workspace sometimes painted a picture of very green grass elsewhere. Going back to Amit’s post, he called Marianne Duijst‘s talk about Narratology and Harry Potter one of the highlights of the whole conference, and I’ve seen a few others echo the same sentiment on Twitter. I had it highlighted on my schedule from day one but at the last minute was enticed by the lightning talks session. I got pages of notes from those talks, but I can’t help but wonder what I missed. Social Media FOMO is real and it takes a lot of mental energy to break out of that negative mental cycle.

Luckily, the flip side of that kind of FOMO is that asking about a session someone else was in, or gave themselves, is a great conversation starter during the coffee breaks.

4. Networking is the worst

For other conferences I’ve been to, I had the benefit either of going with a group of collaborators I already knew or being a local so I could go home at 5 at not worry about dinner plans. Not true when flying alone across the continent. I’ve always been an introvert at the best of times, and I had a hard time breaking out of that to go “network”.

I was relieved when I came across Lisa Crispin writing about how she similarly struggled when she first went to conferences, although that might have helped me more last week than today. Though I’m sure it was in my imagination just as much as it was in hers at her first conference, I definitely felt the presence of “cliques” that made it hard to break in. Ironically, those that go to conferences regularly are less likely to see this happening, since those are the people that already know each other. Speakers and organizers even less so.

It did get much easier once we moved to multiple shorter sessions in the day (lots of coffee breaks) and an organized reception on Wednesday. I might have liked an organized meet-and-greet on the first day, or even the night before the first tutorial, where an introvert like me can lean a bit more on the social safety net of mandated mingling. Sounds fun when I put it like that, right?

I eventually got comfortable enough to start talking with people and go out on a limb here or there. I introduced myself to the all people I aimed to and asked all the questions I wanted to ask… eventually. But there were also a lot of opportunities that I could have taken better advantage of. At my next conference, this is something I can do better for myself, though it also gives me a bit more sensitivity about what inclusion means.

5. I’m ready to start preparing my own talk

Despite my introverted tendencies I’ve always enjoyed teaching, presenting demos, and giving talks. I’ve had some ideas percolating in the back of my mind about what I can bring to the testing community and my experiences this week — in fact every one of the four points above — have confirmed for me that speaking at a conference is a good goal for myself, and that I do have some value to add to the conversation. I have some work to do.

Bonus lessons: Pronouncing “Cynefin” and that funny little squiggle

Among the speakers, as far as notes-written-per-sentence-spoken, Liz Keogh was a pretty clear winner by virtue of a stellar lightning talk. Her keynote and the conversation we had afterward, however, is where I picked up these bonus lessons. I had heard of Cynefin before but always had two questions that never seemed to be answered in the descriptions I had read, until this week:

A figure showing the four domains of Cynefin

  1. It’s pronounced like “Kevin” but with an extra “N”
  2. The little hook or squiggle at the bottom of the Cynefin figure you see everywhere is actually meaningful: like a fold in some fabric, it indicates a change in height from the obvious/simple domain in the lower right from which you can fall into the chaotic in the lower left.

In defence of time over story points

I have to admit, there was a time when I was totally on board with estimating work in “story points”. Briefly I was the resident point-apologist around town, explaining metaphors about how points are like the distance of a race that people complete in different times. These days, while estimating complexity has its uses, I’m coming to appreciate those old fashioned time estimates.

Story points are overrated. Here’s a few of the reasons why I think so. Strap yourselves in, this is a bit of a rant. But don’t worry, I’ll hedge at the end.

The scale is arbitrary and unintuitive

How do you measure complexity? What units do you use? Can you count the number of requirements, the acceptance criteria, the number of changes, the smelliness of the code to be changed, the number of test cases required, or the temperature of the room after the developers have debated the best implementation?

To avoid that question, story points use an arbitrary scale with arbitrary increments. It could be the Fibonacci sequence, powers of two, or just numbers 1 through 5. That itself is not necessarily a problem — Fahrenheit and Celsius are both arbitrary scales that measure something objective — but if you ask 10 developers what a “1” means you’ll get zero answers if they haven’t used points yet and 20 answers 6 months later.

I don’t know anybody who has an intuition for estimating “complexity” because there’s no scale for it. There’s nothing to check it against. Meanwhile we’ve all been developing an intuition for time every since we started asking “are we there yet?” from the back of the car or complaining that it wasn’t late enough for bedtime.

People claim that you can build your scale by taking the simplest task as a “1” and going from there. But complexity doesn’t scale like that. What’s twice as complicated as, say, changing a configuration value? Even if you compare tickets being estimated with previous ones, you’re never going to place it in an ordered list (even if binned) of all previous tickets. You’re guaranteed to have some that are more “complex” than others rated at lower points because you were feeling confident that day or didn’t have a full picture of the work. (Though if you do try this, it can give you the side benefit of questioning whether those old tickets really deserve the points they got.)

It may not be impossible to get a group of people to come to a common intuition around estimating complexity, but it sure takes a lot longer than agreeing on how long a day or a week is. Even if you did reach that common understanding, nobody outside the team will understand it.

Points aren’t what people actually care about

People, be it either the business or dependent teams, need to schedule things. If we want to have goals and try to reach them, we have to have some idea of how much we have to do to get there and how much time it will take to do that work. If someone asks “when can we start work on feature B” and you say “well feature A has 16 points”, their next question is “OK, and how long will that take?” or “and when will it be done?” Points don’t answer either question, and nobody is going to be happy if you tell them the question can’t be answered.

In practice (at least in my experience) people use time anyway. “It’ll only take an hour so I’m giving it one point”. “I’d want to spend a week on this so let’s give it 8 points.” When someone says “This is more complicated so we better give it more points” it’s because they’ll need more time to do it!

Maybe I care about complexity because complexity breeds risk and I’ll need to be more careful testing it. That’s fair, and a decent reason for asking the question, but it also just means you need more time to test it. Complexity is certainly one dimension of that but it isn’t the whole story (impact and probability of risks manifested are others).

Even the whole premise of points, to be able to measure the velocity of a team, admits that time is the important factor. Velocity matters because it tells you how much you work you can reasonably put into your sprint. But given a sprint length you already know how many hours you can fit into a sprint. What’s the point of going around the bush about it?

Points don’t provide feedback

Time provides has a built in feedback that points can’t. That’ll take me less than a day, I say. Two days later we have a problem.

Meanwhile I say something is 16 story points. Two days later it isn’t done… do I care? Am I running behind? What about 4 weeks later? Was this really a 16 point story or not? Oh, actually, someone was expecting it last Thursday? That pesky fourth dimension again!

Points don’t avoid uncertainty

I once heard someone claim that one benefit of story points is that they don’t change when something unexpected comes up. In one sense that’s true, but only if there’s no feedback on the actual value of points. Counterexamples are about as easy to find as stories themselves.

Two systems interact with each other in a way the team didn’t realize. Someone depends on the legacy behaviour so you need to add a migration plan. The library that was going to make the implementation a single line has a bug in it. Someone forgot to mention a crucial requirement. There are new stakeholders that need to be looped in. Internet Explorer is a special snowflake. The list goes on, and each new thing can make something more complex. If they don’t add complexity after you’ve assigned a number, what creates the complexity in the first place?

Sure you try to figure out all aspects of the work as early as possible, maybe even before it gets to the point of estimating for a sprint. Bring in the three amigos! But all the work you do to nail down the “complexity” of a ticket isn’t anything special about “complexity” as a concept, it’s exactly the same kind of work you’d do to refine a time estimate. Neither one has a monopoly on certainty.

Points don’t represent work

One work ticket might require entering configurations for 100 clients for a feature we developed last sprint. It’s dead simple brainless work and there’s minimal risk beyond copy-paste errors that there are protections for anyway. Complexity? Nah, it’s one point, but I’ll spend the whole sprint doing it.

Another work ticket is replacing a legacy piece of code to support an upcoming batch of features. We know the old code tends to be buggy and we’ve been scared to touch it for years because of that. The new version is already designed but it’ll be tricky to plug in and test thoroughly to make sure we don’t break anything in the process. Not a big job—it can still be done in one sprint—but relatively high risk and complex. 20 points.

So wait, if both of those fit in one sprint, why do I care what the complexity is? There are real answers to that, but answering the question of how much work it is isn’t one of them. If you argue that those two examples should have similar complexity since they both take an entire sprint, then you’re already using time as the real estimator and I don’t need to convince you.

Points are easily manipulated

Like any metric, we must question how points can be manipulated and whether there’s incentive to do so.

In order to see increase in velocity, you have to have a really well understood scale. The only way to calibrate that scale without using a measurable unit is to spend months “getting a feel for it”.

Now if you’re looking for ways to increase your velocity, guaranteed the cheapest way to do that (deliberately or not) is to just start assigning more points to things. Now that the team has been at this for a while, one might say, they can better estimate complexity. Fewer unknowns mean more knowns, which are more things to muddy the discussion and push up those complexity estimates. (Maybe you are estimating more accurately, but how can you actually know that?) Voila. Faster velocity brought to you in whole by the arbitrary, immeasurable, and subjective nature of points.

Let’s say we avoid that trap, and we actually are getting better at the work we’re doing. Something that was really difficult six months ago can be handled pretty quickly now without really thinking about it. Is that ticket still as complex as it was six months ago? If the work hasn’t changed it should be, but it sure won’t feel as complex. So is your instinct going to be to put the same points on it? Velocity stagnates even though you’re getting more done. Not only can velocity be manipulated through malice, it doesn’t even correlate with the thing you want to measure!

It’s a feature, not a bug

One argument I still anticipate in favour of points is that the incomprehensibility of them is actually a feature, not a bug. It’s arbitrary on purpose so that it’s harder for people outside the team to translate them into deadlines to be imposed onto that team. It’s a protection mechanism. A secret code among developers to protect their own sanity.

If that’s the excuse, then you’ve got a product management problem, not an estimation problem.

In fact it’s a difficulty with metrics, communication, and overzealous people generally, not something special about time. The further metrics get from the thing they measure, the more likely they are to be misused. Points, if anybody understood them, would be just as susceptible to that.

A final defence of complexity

As far as a replacement for estimating work in time, story points are an almost entirely useless concept that introduces more complexity than it estimates. There’s a lot of jumping through hoops and hand waving to make it look like you’re not estimating things in time anymore. I’d much rather deal in a quantity we actually have units for. I’m tempted to say save yourself the effort, except for one thing: trying to describe the complexity of proposed work is a useful tool for fleshing out what the work actually requires and to get everybody on an equal footing understanding that work. That part doesn’t go away, though the number you assign to it might as well. Just don’t pretend it’s more meaningful than hours on a clock.