Little Alchemy: A silly little test exercise

Lately I’ve been entertaining myself with a silly little mobile app called Little Alchemy 2, a simple revamp of an earlier concept game. It distracted me enough this week that I didn’t prepare a respectable blog post. While questioning my life choices, I started to convince myself that maybe I had been playing at testing this whole time. Testers, myself included, love to make analogies to science, and this is a game that simulates the precursor to modern chemistry. Of course it can be used as a tortured metaphor for software testing! Follow me on my journey…

Exploration

Screenshot of Little AlchemyThe main concept of the game is to drag icons of real life objects on top of each other, combining them to make new items.

At first there are only a few options: Water, Earth, Fire, Air. The classics. Not hard to just play around with different combinations and see how this game works.

Fire + Water = Steam. Makes sense.

Water + Earth = Mud. Great.

Now I wonder what happens if I combine steam and mud….

After a while you start to get a sense for how this universe works.

Developing heuristics

After a while I start to see patterns. If I combine the same thing with itself I get a bigger version of that thing; can I do that with everything? Fire seems transform most items. When I find out how to make paper or wood, I’m probably going to try to throw those on the fire too.

Combinations quickly multiply.

Before long I have hundreds of items of all sorts. Too many to keep in my head. I start to think I’m probably missing combinations, or I’ve forgotten what combinations I’ve already tried. I know I tried putting wood into fire but I did I put wood into the campfire? Am I missing some obvious combination that’s right in front of me?

Using an Oracle

That’s when I start googling around for hints and suggestions. This part gets a bit cheaty, but at the time it was what I needed to do to get a handle on the problem and keep making progress. I found a site that would randomly show me an item, and I could use that to see if I had the pieces I needed to make it. No more guessing, I was given the right answer.

I suppose this is where I could veer off into automation, but where’s the fun in that? After a while, I started to exhaust the hint system anyway; with only random items the ratio of new to known items started to decline. ROI went down. My oracle was getting less useful, not telling me anything new.

Brute force

There were still an intractable number of items and I had seen enough unexpected combinations that I didn’t trust myself to reason them all out myself. So instead, I turned to brute force.

First item on the list. Try to combine with every item below it. Repeat.

Now I should really think about automation if my goal was to just find all combinations. This is a pure algorithm with finite steps since the game promises a finite number of items. But things start to change after using this for a few iterations manually. Happily the game removes items that have no undiscovered combinations, so in theory the complexity will peak and game will start get simpler again. (Wouldn’t it be nice if software always got simpler with time?) Moreover, a little bit of brute force starts to show patterns that I hadn’t been aware of before. I start to skip ahead: “aha! if that combination works, then I bet this other one will too…” One new strategy begets another!

Inference and deduction

I reach a breaking point where items are being removed from play more and more often. This feels like the home stretch but brute forcing it will still take ages. Often, though, an item only has one or two combinations left to discover before it’s exhausted. I use this to my advantage.

Enter another oracle of sorts; this time, it’s the documentation of everything I’ve done so far. For any item still on the board, the game tells me how many undiscovered combinations an item has, items that have been used to make it, and all the other items it has been used to make so far. This is all data I can analyse to look for yet more patterns, and spot gaps in patterns that have been missed so far. The rate at which I clear items off the board goes way up, and I’m still using the same manual interactions I’ve used all along.

End game

I’m not there yet. Do I still have more to learn? Is another strategy going to make itself obvious before I finish the game, or an old one resurface as the dynamics change?

And what am I going to explore next?

How do you know you actually did something?

When I was defending my thesis, one of the most memorable questions I was asked was:

“How do you know you actually did something?”

It was not only an important question about my work at the time, but has also been a very useful question for me in at least two ways. In a practical sense, it comes up as something like “did we actually fix that bug?” More foundationally, it can be as simple as “what is this test for?”

Did it actually work?

The original context was a discussion about the efforts I had gone through to remove sources of noise in my data. As I talked about in my previous post, I was using radio telescopes to measure the hydrogen distribution in the early universe. It was a very difficult measurement because even in the most optimistic models it was expected to be several orders of magnitude dimmer than everything else the telescopes could see. Not only did we have to filter out radio emission from things in the sky we weren’t interested in, it should not be surprising that there’s also a whole lot of radio coming from Earth. Although we weren’t actually looking in the FM band, it would be a lot like trying to pick out some faint static in the background of the local classic rock radio station.

One of the reasons these telescopes were built in rural India was because there was relatively little radio in use in the area. Some of it we couldn’t do anything about, but it turned out that a fair amount of radio noise in the area was accidental. The most memorable example was a stray piece of wire that had somehow been caught on some high voltage power lines and was essentially acting like an antenna leaking power from the lines and broadcasting it across the country.

We developed a way of using the telescopes to scan the horizon and for bright sources of radio signals on the ground and triangulate their location. I actually spent as much time wandering through the countryside with a GPS device in one hand and a radio antenna in the other trying to find these sources. This is what led to what has since become the defining photo of my graduate school experience:

Standing in a field with a radio antenna, next to a cow
Cows don’t care, they have work to do.

Getting back to the question at hand, after spending weeks wandering through fields tightening loose connections, wrapping things in radio shielding, and getting the local power company to clean wires of their transmission lines… did we actually fix anything? Did we reduce the noise in our data? Did we make it easier to see the hydrogen signal we were after?

Did we actually fix the bug?

In many ways, getting rid of those errant radio emitters was like removing bugs in data. Noisy little things that were there only accidentally and that we could blame for at least some of the the noise in our data.

But these weren’t the bugs that you find before you release. Those are the bugs you find because you anticipated that they would come up and planned your testing around it. These things, in contrast, were already out in the field. They were the kinds of bugs that come from the user or something weird someone notices in the logs. You don’t know what caused them, you didn’t uncover them in testing, and you’re not sure at first what is triggering them. These are the ones that are met only with a hunch, an idea that “this might have to do with X” and “we should try doing Y and that might fix it.”

But how do you actually know that you’ve fixed it? Ideally you should have a test for it, but coming up with a new test that will catch a regression and seeing it pass isn’t enough. The test needs to fail. If it doesn’t you can’t show that fixing the bug actually causes it to pass. If you aren’t able to see the issue in the first place, it doesn’t tell you anything if you make a fix and then still don’t see the issue.

For us in the field, the equivalent reproducing the bug was going out with an antenna, pointing it at what we thought was a source, and hearing static on a handheld radio. One step after fixing it (or after the power company told us they fixed it) was to go out with the same antenna as see if the noise had gone away or not. The next step was turning on the antennas and measuring the noise again; push the fix to production and see what happens.

What is this test for?

Where this can go wrong — whether you know there’s a bug there or not — is when you have a test that doesn’t actually test anything useful. The classic example is an automated test that doesn’t actually check anything, but it can just as easily be the test that checks the wrong thing, the test that doesn’t check what it claims, or even the test that doesn’t check anything different from another one.

To me this is just like asking “did you actually do something”, because running tests that don’t actually check anything useful don’t do anything useful. If your tests don’t fail when there are bugs, then your tests don’t work.

In a world where time is limited and context is king, whether you can articulate what a test is for can be a useful heuristic for deciding whether or not something is worth doing. It’s much trickier than knowing whether you fixed a specific bug, though. We could go out into the field and hear clearly on the radio that a noise source had been fixed, but it was much harder to answer whether it paid off for our research project overall. Was it worth investing more time into it or not?

How do you know whether a test is worth doing?