The Greg Score: 12 Steps to Better Testing

Ok, I’ll admit right off the bat that this post is not going to give you 12 steps to better testing on a silver platter, but bear with me.

A while back, I was trying to figure out a way for agile teams without a dedicated tester or QA expert on their team to recognize bottlenecks and inefficiencies in their testing processes. I wanted a way of helping teams see where they could improve their testing by sharing the expertise that already existed elsewhere in the company. I had been reading about maturity models, and though they can be problematic—more on that later—it lead me to try to come up with a simple list of good practices teams could aim to adopt.

When I started floating the idea with colleagues and circulating a few early drafts, a friend of mine pointed out that what I was moving towards was a lot like a testing version of the Joel Test:

The Joel Test: 12 Steps to Better Code

Now, to be clear, that Joel Test is 18 years old, and it shows. It’s outdated in a lot of ways, and even a little insulting (“you’re wasting money by having $100/hour programmers do work that can be done by $30/hour testers”). It might be more useful as a representation of where software development was in 2000 than anything else, but some parts of it still hold up. The concept was there, at least. The question for me was: could I come up with a similarly simply list of practices for testing that teams could use to get some perspective on how they were doing?

A testers’ version of the Joel Test

In my first draft I wrote out ideas for good practices, why teams would adopt it, and examples of how it would apply to some of the specific products we worked on. I came up with 20-30 ideas after that first pass. A second pass cut that nearly in half after grouping similar things together, rephrasing some to better expose the core ideas, and getting feedback from testers on a couple other teams. I don’t have a copy of the list that we came up with any more, but if I were to come up with one today off the top of my head it might include:

  1. Do tests run automatically as part of every build?
  2. Do developers get instant feedback when a commit causes tests to fail?
  3. Can someone set up a clean test environment instantly?
  4. Does each team have access to a test environment that is independent of other teams?
  5. Do you keep a record of tests results for every production release?
  6. Do you discuss as a team how features should be tested before writing any code?
  7. Is test code version controlled in sync with the code it tests?
  8. Does everybody contribute to test code?
  9. Are tests run at multiple levels of development?
  10. Do tests reliably pass when nothing is wrong?

I’m deliberately writing these to be somewhat general now, even though the original list could include a lot of technical details about our products and existing process. After I left the company, someone I had worked with on the idea joked with me that they had started calling the list the “Greg Score”. Unfortunately the whole enterprise was more of a spider than a starfish and as far as I know it never went anywhere after that.

I’m not going to go into detail about what I mean about each of these or why I thought to include it today, because I’m not actually here trying to propose this as a model (so you can hold off on writing that scathing take down of why this is a terrible list). I want to talk about the idea itself.

The problem with maturity models

When someone recently used the word “mature” in the online community in reference to testing, it sparked immediate debate about what “maturity” really means and whether it’s a useful concept at all. Unsurprisingly, Michael Bolton has written about this before, coming down hard against maturity models, in particular the TMMi. Despite those arguments, the only problem I really see is that the TMMi is someone else’s model for what maturity means. It’s a bunch of ideas about how to do good testing prioritized in a way that made sense to the people writing it at the time. Michael Bolton just happens to have a different idea of what a mature process would look like:

A genuinely mature process shouldn’t emphasize repeatability as a virtue unto itself, but rather as something set up to foster appropriate variation, sustainability, and growth. A mature process should encourage risk-taking and mistakes while taking steps to limit the severity and consequence of the mistakes, because without making mistakes, learning isn’t possible.

— Michael Bolton, Maturity Models Have It Backwards

That sounds like the outline for a maturity model to me.

In coming up with my list, there were a couple things to emphasize.

One: This wasn’t about comparing teams to say one is better than another. There is definitely a risk it could be turned into a comparison metric if poorly managed, but even if you wanted to it should prove impossible pretty quickly because:

Two: I deliberately tried to describe why a team would adopt each idea, not why they should. That is, I wanted to make it explicit that if the reasons a team would consider adopting a process didn’t exist, then they shouldn’t adopt it. If I gave this list to 10 teams, they’d all find at least one thing on it that they’d decide wasn’t important to their process. Given that, who cares if one team has 2/10 and another has 8/10, as long as their both producing the appropriate level of quality and value for their contexts? Maybe the six ideas in between don’t matter in the same way to each team, or wouldn’t have the same impact even if you did implement them.

Third: I didn’t make any claims that adopting these 10 or 12 ideas would equate to a “fully mature” or “complete” process, they were just the top 10 or 12 ideas that this workgroup of testers decided could offer the best ROI for teams in need. It was a way of offering some expertise, not of imposing a perfect system.

Different models for different needs

This list doesn’t have everything on it that I would have put on it two years ago, and it likely has things on it that I’ll disagree with two years from now. (Actually I wrote that list a couple days ago and I am already raising my eyebrow at a couple of them.) I have no reason to expect that this list would make a good model for anybody else. I don’t even have any reason to expect that it would make a good model for my own team since I didn’t get their input on it. Even if it was, I wouldn’t score perfectly on it. If you could then that means the list is out of date or no longer useful.

What I do suggest is to try to come up a list like this for yourself, in your own context. It might overlap with mine and it might not. What are the key aspects of your testing that you couldn’t do without, and what do you wish you could add? It would be very interesting to have as many testers as possible to write their own 10-point rubric for their ideal test process to see how much overlap there actually is, and then do it again in a year or two to see how it’s changed.


Agile Testing book club: Everyone is a Tester

If you’ve even dipped your toe into the online testing community, there’s a good chance that you’ve heard Agile Testing by Janet Gregory and Lisa Crispin as a recommended read. A couple weeks ago I got my hands on a copy and thought it would be a useful exercise to record the highlights of what I learn along the way. There is a lot in here, and I can tell that what resonates will be different depending on where I am mentally at the time. What I highlight will no doubt be just one facet of each chapter, and a different one from what someone else might read.

So, my main highlight from Chapters 1 and 2:

everyone is a tester

Janet and Lisa immediately made an interesting distinction that I would never have thought of before: they don’t use “developer” to refer to the people writing the application code, because everybody on an agile team is contributing to the development. I really like emphasizing this. I’m currently in an environment where we have “developers” writing code and “QA” people doing tests, and even though we’re all working together in an agile way, I can see how those labels can create a divide where there should not be one.

Similarly surprising and refreshing was this:

Some folks who are new to agile perceive it as all about speed. The fact is, it’s all about quality—and if it’s not, we question whether it’s really an “agile” team. (page 16)

The first time I encountered Agile, it was positioned by managers as being all about speed. Project managers (as they were still called) positioned it as all about delivering something of value sooner than possible otherwise, which is still just emphasizing speed in a different way. If asked myself, I probably would have said it was about being agile (i.e., able to adapt to change) because that was the aspect of it that made it worth adopting compared to the environment we worked in before. Saying it’s all about quality? That was new to me, but it made sense immediately, and I love it. Delivering smaller bits sooner is what lets you adapt and change based on feedback, sure, but you do that so you end up with something that everyone is happier with. All of that is about quality.

So now, if everybody on the team should be counted as a developer, and everything about agile is about delivering quality, it makes perfect sense that main drive for everybody on the team should be delivering that quality. The next step is obvious: “Everyone on an agile team is a tester.” Everyone on the team is a developer and everyone on the team is a tester. That includes the customer, the business analysts, the product owners, everybody. Testing has to be everybody’s responsibility for agile-as-quality to work. Otherwise how do you judge the quality of what you’re making? (Yes, the customer might be the final judge of quality means to them, but they can’t be the only tester any more than a tester can be.)

Now, the trick is to take that understanding and help a team to internalize it.

What the Bug!? (An attempt at knowledge sharing, two ways)

When I was making the transition from waterfall style projects to agile teams in a previous company, one of the main things I struggled with was the loss of the testing team as we all became generic “software engineers”. We all still did the same testing tasks, but none of us were “testers” any more. There were a lot of positive effects from the change, but I kept feeling like without dedicated testers focused on improving our testing craft, we’d stagnate.

What I was missing, as I eventually realized, was a testing community. A recent episode of the AB Testing podcast talked all about building a community of testers, which brought all of this back to mind. A few of the ideas Alan and Brent talked about I had actually tried at the time. One in particular was the idea of highlighting major fails of the week in a newsletter, even offering prizes for the “winner”. Back in 2016 I heard a similar idea at a STARCanada talk, where the engineering group at AppFolio would send an email to everybody in the company for every bug they found describing what was learned, again emphasizing that finding these bugs was a positive thing, not a blame game. (Sorry I can’t find now who gave that talk; if it was you let me know!)

The reason the idea stuck with me at the time was primarily because I had started to notice that as our newly agile teams specialized in subsets of what was previously a monolithic product, we started to loose visibility on bugs that other teams ran into. Different teams were getting bit by the same issues. It didn’t help that the code base was also old enough that newer people would run into old bugs, spend potentially hours debugging it, and then hear “oh yeah, that’s a known problem.” (My response to that was to scream “It isn’t known if people don’t know it!” silently to myself.)

Here’s what I tried to do about it:

What The Bug!?
Honestly, the part of the idea I’m most proud of might be the name

I wanted teams to start talking more about the bugs they found, both so that others could learn from them and so that we could all tune our spidey senses to the sorts of issues that were actually important. There wasn’t much appetite for an email newsletter—people didn’t seem to read the newsletters the company already had—but we ended up trying two alternatives, one of which was pretty successful and one of which really wasn’t.

Building A knowledge base

The first idea was to solicit short and easily digestible details about every production bug that got logged into our bug tracker. Anybody who logged a bug would get an email asking them to answer three questions. The key was that the answers had to be short—think one tweet—and written at a level that any developer in the company would be able to understand. Bonus points for memorable descriptions. The questions were roughly:

  1. What was the bug?
  2. What was important about it?
  3. What one lesson can we learn from it?

The answers were linked back to the original ticket and tagged with a bunch of meta data like which team found it, the part of the system it was found in, what language the code was in, and any other keywords that would make it easy to find again. The idea was that if I was going to start writing something in a new language or working on a new part of the system, I could go look up the related tags and immediately find a list of easily digestible problems that I should stay alert for. I think it was an okay idea, but there were issues.

First problem: People were pretty bad at writing descriptions of bugs that were short, but also useful and interesting. It not only took a lot of creativity, but in order to do it well you also really had to examine what was important about the bug in the first place. The example I used as a bad answer to the second question was “This caused an error every Tuesday”; What caused what kind of error every WHY TUESDAY!? This was especially problematic for the third question, where often the answers that came back were “testing is important” or “we’ll test this next time”. True, but shallow. What I was really hoping for would have been “There are different kinds of Unicode characters, so always consider testing different character planes, byte lengths, and reserved code points”. To really make the knowledge base as useful as it could have been, it would have needed committed editors who would talk to the people involved and craft a really good summary with them.

Second problem: The response rate was pretty lousy. It might have been that targeting every production bug was just too much. Not everybody is going to see something interesting in every bug, and not everybody is as interested in learning from them. That was part of the culture that I wanted to change—I wanted everybody else to be as excited about these bugs as I was!—but it wasn’t going to happen over night.

Third problem: It might seem minor, but the timing of the email prompt coming the day after the bug was logged was often just too soon to have digested what the real lesson learned was. This turned up as a problem as I asked around about why people weren’t taking part. They just didn’t have the answers yet.

All of this created a chicken-and-egg problem. Until people saw the value of this project, saw interesting summaries, and got excited to contribute their own experiences, we wouldn’t get the content we needed to build that interest or excitement. And, in all honesty, though there was a conscious effort with this to make an accessible library of bugs compared to the technical JIRA tickets, we were still basically asking people to log a bug in one database every time they logged a bug in another database. We needed something more active and engaging.

Welcome to the What The Bug!? Show

At the time we already had a weekly all-hands meeting every Friday afternoon where anybody could contribute a segment to demo something, talk about something interesting, or anything else they wanted. I was doing short segments on quality and testing topics every few weeks to try to promote testing in general, but it was largely a one-man show. A fellow tester/developer that I was working with on the What The Bug!? knowledge base had the idea to take just a couple minutes each week to present our favourite bugs of the week.

Turning a passive library of bugs into a weekly road show was a big success. We were basically getting the benefit of a bug-of-the-week newsletter with the added bonus of an already established live audience. Again, the branding helped to sell it, and because the two of us both embraced the idea that any production bug could be turned into an 2 minute elevator pitch for something interesting to learn, we were able to make pretty fun presentations. We at least got laughs and had people thinking about testing if only for 5 minutes, but the real highlight for me was that afterwards I had people come up to me and say “you know, I found something last week that would make a good segment next time.”

This was possible in part because she and I had both spent that time soliciting user-friendly descriptions from people logging bugs, so we knew what had been going on across all the teams for the last couple weeks. We had a lot to choose from for the feature. It would have been harder to do without that knowledge base, being limited to only the bugs we knew from the teams we worked directly with. I suspect, though, that once the weekly What The Bug!? segments built up enough momentum that people took on presenting their own bugs, the need for the email prompts and the knowledge base would fade away. An archive of the featured bugs could still be a useful resource for new people coming on board, but it would no longer be the primary driver.

Where are things now

I left the company shortly after starting What The Bug!?, but I recently had a chance to check in with an old colleague and inquire about where these two manifestations of the idea ended up. Unsurprisingly in retrospect, the knowledge base has largely fizzled. The combination of asking people to volunteer extra writing work and the low ROI of a poorly written archive doomed it pretty early on. On the flip side, bugs are still a regular topic at those weekly meetings, although I’m sorry to report that they dropped the What The Bug!? branding. If anybody else wants to try something similar, feel free to use the name, all I ask for is a hat tip. (A quick google of the phrase only turns up some etymologists and an edible insect company, but if someone else had the same idea in the context of software, do let me know and I’ll pass the hat tip on.)

Tests vs Checks and the complication of AI

There’s a lot to be made in testing literature of the differences between tests and checks. This seems to have been championed primarily by Michael Bolton and James Bach (with additional background information here), though it has not been without debate. I don’t have anything against the distinction such as it is, but I do think it’s interesting to look at what whether it’s really a robust one.

The definitions, just as a starting point, are given in their most recent iteration as:

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

These come with a host of caveats and clarifications; do go back and read the full explanation if you haven’t already. The difference does not seem intuitive from the words themselves, which may be why there is such debate. Indeed, I’ve never seen anybody other than testing professionals make the distinction, so in normal usage I almost exclusively hear “test” used, and never “check”. Something I might call an automated test, others might call—and insist that it be called—an automated (or machine) check. This is just a consequence of working day-to-day with developers, not with testing nerds who might care about the difference.

Along those lines, I also find it interesting that this statement, still quoting from James Bach’s blog:

One common problem in our industry is that checking is confused with testing. Our purpose here is to reduce that confusion.

goes by with little explanation. There is a clear desire to differentiate what a human can do and what a computer can do. The analogy in the preamble to craftspeople being replaced by factory workers tries to illustrate the problem, but I’m not sure it really works. The factory model also has advantages and requires it own, different, set of skilled workers. I may just be lucky in that I haven’t ever worked in an environment where I was under pressure to blindly automate everything and dismiss the value humans bring to the process, so I’ve never needed the linguistic backing to argue against that. This affords some privilege to wonder whether this distinction has come about only because of a desire to differentiate between what a computer and a human can do, or because there actually is a fundamental difference.

Obviously, as far as what is possible today, there is no argument. But the more we see AI coming into use in testing, the more difficult this distinction will become. If I have an AI that knows how to use different kind of apps, and I can give it an app without giving it any specific instructions, what is it doing? Can it ever be testing, or is it by definition only checking? There are AI products being pushed by vendors now that can report differences between builds of an app, though for now these don’t seem to be much more than a glorified diff tool or monitor for new error messages.

Nonetheless, it’s easy to imagine more and more advanced AIs that can better and better mimic what a real end user (or millions of end users) might do and how they would react. Maybe it can recognize UI elements or simulate the kinds of swipe gestures people make on a phone. Think of the sort of thing I, as a human user, might do when exploring a new app: click everywhere, swipe different ways, move things around, try changing all the settings to see what happens, etc. It’s all that exploration, experimentation, and observation that’s under the “testing” definition above, with some mental model of what I expect from each kind of interaction. I don’t think there’s anything there that an AI fundamentally can’t do, but even then, there would be some kind of report coming out the other end about what the AI found that would have to be evaluated and acted upon by a human. Is the act of running the AI itself the test, and every thing else it does just checks? If you’re the type that wants to say that “testing” by its nature can’t be automated, then do you just move the definition of testing to mean interpreting and acting on the results?

This passage addresses something along those lines, and seems to answer “yes”:

This is exactly parallel with the long established convention of distinguishing between “programming” and “compiling.” Programming is what human programmers do. Compiling is what a particular tool does for the programmer, even though what a compiler does might appear to be, technically, exactly what programmers do. Come to think of it, no one speaks of automated programming or manual programming. There is programming, and there is lots of other stuff done by tools. Once a tool is created to do that stuff, it is never called programming again.

Unfortunately “compiling” and “programming” are both a distracting choice of words for me (few of the tools I use involve compiling and the actual programming is the least interesting and least important step in producing software). More charitably, perhaps as more and more coding becomes automated (and it is), “programming” as used here becomes the act of deciding how to use those tools to get to some end result. When thinking about how the application of AI might confuse “tests” vs “checks”, this passage stuck out because it reminded me of another idea I’ve heard which I can only paraphrase: “It’s only called machine learning (or AI) until it works, then it’s just software”. Unfortunately I do not recall who said that or if I am even remembering the point correctly.

More to the point, James also notes in the comments:

Testing involves creative interpretation and analysis, and checking does not

This too seems to be a position that, as AI becomes more advanced and encroaches on areas previously thought to be exclusive to human thought, will be difficult to hold on to. Again, I’m not making the argument that an AI can replace a good tester any time soon, but I do think that sufficiently advanced tools will continue to do more and more of what we previous thought was not possible. Maybe the bar will be so high that expert tester AIs are never in high enough demand to be developed, but could we one day get to the point where the main responsibility a human “tester” has is checking the recommendations of tester AIs?

I think more likely the addition of real AIs to testing just means less checking that things work, and more focus on testing whether they actually do the right thing. Until AIs can predict what customers or users want better than the users themselves, us humans should still have plenty to do, but that distinction is a different one than just “test” vs “check”.

Yes, I test in production

Recently a post on Reddit about a company doing tests on a live production environment sparked some conversation on the slack channel about whether “testing in production” is a wise idea or not. One Reddit user commented saying:

Rule number 1 of testing (i.m.o): DO IT ON A NON-PRODUCTION ENVIRONMENT.
Rule number 2 of testing (i.m.o): Make sure you are NOT, I repeat, NOT on a production environment.

Three years ago I might have agreed with that. Today, I absolutely don’t. Am I crazy?

never test in production… for some definitions of production

The original post describes a situation where some medical equipment stopped working over night. After much debugging and technical support, the cause was identified to be that the machines were remotely put into a special mode for testing by the vendor and not restored before the morning.

There’s unlikely anything controversial in saying that this wasn’t a good move on the vendor’s part. They were messing with something a customer was, or could have been, using. Without notifying them about it. Though it’s all the more egregious because of being medical equipment, any customer would be annoyed by this when they found out. But you can’t extend that in a blanket way to all kinds of production environments.

There is certainly a lesson to be learned here, but we will get more from it by being more specific. One might suggest any of:

  • Never test something your customer is already using
  • Never test in a way your customer will notice
  • Never test something your customer will notice without telling them

But I can think of counter-examples to each of these, and it boils down to a very simple observation.

If you never test x, you will miss things about x

If you never test in production, you’re robbing yourself of insights about production. You won’t necessarily miss bugs, but unless you have a test environment that mimics every aspect of production perfectly (and none of us do), there will be something that goes on in production that you won’t see.

This is what I didn’t understand four years ago when I started in this line of work. In my first testing job, we didn’t test in production, so why would we test in production? The naïveté of a newbie tester, eh?

Over time we got bit by this in many different ways. The most common category of issues were use cases happening in production that we simply didn’t anticipate. Of course, some of these you might find earlier by using production data in a test environment, but you will always be limited by your sample. A second category of issues would then arise around faulty assumptions in your test environment. It’s those little details that “shouldn’t affect testing” until they do. If you’re lucky you spot an error right away. If you’re less lucky you push a feature that quietly does nothing until somebody notices it isn’t there. If you’re really having a bad day it silently does the wrong thing.

It’s around this time that you start to catch on to the fact that you need to test new deployments, at the very least to verify that something is working “in the real world”. At this point you’re testing, to some degree, in production. Are we as far along as turning off medical equipment a customer is using? No, but we are already bending the “never test in production” rule.

This is just the beginning of a long list of reasons; iAmALittleTester has compiled a list of many similar scenarios where testing in production can provide information that you aren’t getting otherwise. All of these, though, count on the fact that you aren’t going to break anything by testing. (Maybe this is the crux: if you think that the job of a software tester is to break the software, then you likely think “testing in production” is synonymous with “breaking production”!)

What can you do safely?

One of the key differences between testing in a web or back-end server environment and hardware owned by a customer is that I can usually send requests to a server and examine the response without impacting any other requests hitting the same server. I probably can’t do that with customer hardware. The parallel would be that you probably shouldn’t use a customer’s account for your tests. You need to be careful about any state or statistics a system might be keeping track of. (Though I would raise an eyebrow at anything that needs to be aware that it is being tested. If you have “test mode” in your code, you’ve just created a mini test environment in a live program. Are you gaining any advantages from being in production in that case?)

Whether this works, and the types of things you can do, will depend on the issues you expect to find. I’m not going to try stress testing my production environment during peak traffic. If I suspect that a certain kind of request will corrupt the state of the server, I’m certainly not going to do that in production either. If my test has any chance of having a negative effect on a user, I’m not going to risk that. But on a web app, one more anonymous request should be no different from what “real” users are sending the app. And on the subject of “real” users…

I’m not just a QA, I’m also a user

This is the aspect that I’ve found most useful about testing a production environment. Much like experience is often best gained by doing, knowledge of how a product works may be best gained by using it. If you only use a product in a test environment, then you only know about how your product works in a test environment. There are lots of insights about a product that can’t come from simply using it, of course, and in some cases it isn’t realistic to expect to be able to use a product as much as the intended users. But if it possible, if you can make it possible, then it is an opportunity to see things in a different way.

When I do get to use something I’m working on like an end-user does, on some level I’m always testing it. It’s not a big leap to say that your end users are, on some level, testing your product every time they use it. If your users are testing in production, why aren’t you?

Addendum: Rosie Sherry pulled together some resources on testing in production over at the MoT Club that are definitely worth checking out. Chaos Monkey from Netflix is one really interesting way of testing your production setup that I meant to mention in this post, but was only reminded of again when I came across this thread.