The Hidden Power of Testing

Not along ago, when programming was just a hobby, I thought there was nothing more to testing code than:

writing some code
checking that it compiles
running it
checking that it did what you expected it to do.

And if steps 2–4 failed, go back to 1 and try again.

With a little more experience in programming and having had hands on experience working on various projects with other actual human beings, I have a greater appreciation for testing and the role it plays in development. For the Waterbear team, I was responsible for setting up our unit tests and our continuous development environment. As well, I designed acceptance tests for developing the debugging feature.

This post is for the beginner or the intermediate programmer. Testing—whether it’s unit tests or acceptance tests—is not only a way of making sure your code works—it’s also a way to discover how to design software.

And because I cannot imagine any topic quite as dry as “Eddie extols the virtues of testing” I’ll be filling this post up with tangentially related animated GIFs.

Meowton's Cradle

So let’s get started!

But Eddie! What is testing?

The word “testing” itself is pretty intuitive. You have a thing. You check if it works. Boom. Testing! Let’s go for a celebratory pint!

Testing of the Dead

But what you’re really doing is ensuring the quality of the program. Testing is not only checking that the code works, but that it works well, that the software behaves as expected, and—and this one takes a while to learn—that the code is resilient to future modifications. Resilient, like the extremophiles of the phylum tardigrada, the hearty waterbears!

Waterbear in love... or space, I guess.

What I’ve come to find is that this process of defining and having a clear vision of what it means for your code to be right, gives you a clear vision of how to write your code right. It might just be that when you take on a project, you have no where to start. The task of programming may seem daunting, but it is more so when there’s no clear place to begin. In this case… may I suggest testing?

Despite the abundance of methods, methodology and testing fanboys (yes, even more than myself), there is no one way to do testing The Right Way™. The right way is the way that benefits you, the programmer. That said, testing does comes in many forms, and there a lot of buzzwords and funny language that people speak in when talking about testing. I’m going to try to clear up a few major concepts: unit testing and acceptance testing.

Unit testing!

This is probably what most people refer to when they’re talking about “the tests”. It might be that you think unit testing is the only kind of testing that is considered to be Real Testing™ (spoilers: nope).

Unit tests are all about:

breaking down your program into little tiny, self-contained pieces (the so called units).
and making sure those itty-bitty pieces work well on their own.

The astute reader may notice something: this assumes that the problem can be nicely broken down into little tiny, self-contained pieces. For a number of reasons, this is not always an easy thing to accomplish, and the benefits of breaking down your program into testable units may not fully justify the effort. For example, if your program needs to deal with external systems, you have to go through the effort of emulating the external system and all of its possible conditions, which is not unheard of, but certainly requires considerable effort. Nevertheless, unit testing is must have in your testing arsenal.

How do you do it?

Unit testing is so prevalent, most language have standard or de facto ways of unit testing. For example, Java has jUnit, Python has the unittest module, and for client-side JavaScript, there are many options, including qUnit. Basically, if you want a unit testing framework look for a thing for your language that ends in “unit”. Of course, these aren’t the only options, but without putting much effort, you’ve probably got something that will easily integrate with your system.

But that misses the point. The most important part of unit testing is identifying the units. You gotta find the smallest pieces of whole cohesive units of code and only then can you assert that these units work. As long as you know the small parts work, then you’ll have the confidence to assemble them into a larger system. But, er… how do you identify the units?

Looking for units

Maybe the better option is figure out what what’s easy and obvious to test before you write the code for it. Then, it becomes clear what you need to code, and what is a good chunk of code to call a “unit”.

Take this actual example of actual Python programming that I actually wrote. For this script, I wanted to do a syntax check on Python files that it downloaded. So I started defining a function along with its docstring. Docstrings, for those uninformed, are a super nifty language feature in Python. Wanna document a function, class, or module? Then the first line of said function, class, or module should be a string which will automatically become associated with the __doc__ special attribute of that object. This is super rad, you guys! But it gets even radder.

def syntax_ok(contents):
    """
    Given a source file, returns True if the file compiles.
    """

Okay, so we’ve got the purpose down. That’s one step of the process of understanding what task you have to accomplish as a programmer. But how will you know that this works? Enter the doctest: beginning Python programmers are familiar with the interactive console where they can try lines of code and see their result. The prompt for the interactive console is three right-facing arrows: >>> . Say I’m testing the completed syntax_ok() function. A sample session in the interactive console may look like this:

>>> syntax_ok('print("Hello, World!")')
True
>>> syntax_ok('import java.util.*;')
False
>>> syntax_ok('\x89PNG\x0D\x0A\x1A\x0A\x00\x00\x00\x0D')
False
>>> syntax_ok(r"AWESOME_CHAR_ESCAPE = '\x0G'")
False

So, why not put what you expect to happen on the console into your documentation and call it test? That’s exactly what a doctest is. Now we know what we want to both in human terms, and what we want to accomplish in computer terms—in this case, the return value of this function. Additionally, since, we embedded the test in the documentation, we have precise documentation for how to use the function in the future. And if it passes the tests, then we know our documentation is correct. I told you it got even raddererer! Putting it together, it looks like this:

def syntax_ok(contents):
    r"""
    Given a source file, returns True if the file compiles.

    >>> syntax_ok('print("Hello, World!")')
    True
    >>> syntax_ok('import java.util.*;')
    False
    >>> syntax_ok('\x89PNG\x0D\x0A\x1A\x0A\x00\x00\x00\x0D')
    False
    >>> syntax_ok(r"AWESOME_CHAR_ESCAPE = '\x0G'")
    False
    """

At this stage, I knew what I wanted, but didn’t know how to accomplish it, so I just looked up appropriate documentation. But the important part was that I now knew where to start. I implemented it thusly:

try:
        compile(contents, '<unknown>', 'exec')
        # why does compile throw so many generic exceptions...? >.<
    except (SyntaxError, TypeError, ValueError):
        return False
    return True

There we go! As soon as I implemented it, I had working tests that I could run like this¹:

python -m doctest ghdwn.py

As with standard counterintuitive Unix fashion, no output means that my all of my tests passed!

But lo! As syntax_ok() was being called in a long-running script that checked the syntax of many, many, many Python files, an enormous flaw soon became apparent. After a while, my script would crash with a MemoryError, indicating that my program had somehow run out of memory.

Pictured: me encountering a memory leak

Evidentially, calling compile() cached the results of compiling code—code that I never used, since I merely called compile() for its side-effect of reporting whether a file contained syntax errors. As a result, I had to fix this dreaded memory leak to stop randomly crashing my long running script.

This is where having unit tests really pays off. The inevitable occurred: I had to modify my code and had to make sure that it did the same thing. Luckily, I could assert that my code behaved the same since I had basic tests in place. Now all I had to do was figure out how to patch that memory leak. It occurred to me that I can’t have a memory leak in a process that exits immediately, so using some Unix voodoo, I fork‘d my process into a parent–the process expecting an answer—and a child—the process that would compile the code, cache the result and promptly exit, destroying along with it the cached results of compilation. My finished product looked like this. Note the doctest within the documentation:

def syntax_ok(contents):
    r"""
    Given a source file, returns True if the file compiles.
    >>> syntax_ok('print("Hello, World!")')
    True
    >>> syntax_ok('import java.util.*;')
    False
    >>> syntax_ok('\x89PNG\x0D\x0A\x1A\x0A\x00\x00\x00\x0D')
    False
    >>> syntax_ok(r"AWESOME_CHAR_ESCAPE = '\x0G'")
    False
    """

    pid = os.fork()
    if pid == 0:
        # Child process. Let it crash!!!
        try:
            compile(contents, '<unknown>', 'exec')
        except:
            # Use _exit so it doesn't raise a SystemExit exception.
            os._exit(-1)
        else:
            os._exit(0)
    else:
        # Parent process.
        child_pid, status = os.waitpid(pid, 0)
        return status == 0

In this way, my simple unit test helped me:

Figure out what task I need to accomplish. This became my “unit”.
Determine what would be the correct output of said unit.
Document how to use my function.
When it came time to change my function, ensure that its behaviour would stay the same.

Tubular

How we use unit tests in Waterbear

In Waterbear, we use unit testing to ensure that underlying block implementations—the runtime functions—return the proper results. Since this code is the code that ultimately runs when a block is used in a Waterbear program, we must ensure that the behaviour remains consistent as development progresses. For this, we created a QUnit test suite, which can be run in a browser, such as Chrome or Firefox. In addition to testing in a browser, we can also run it in a headless browser like PhantomJS. This allows us to run tests in the command line, and even on a foreign server every time we update some code.

Enter continuous integration. Whenever we push code to GitHub, a worker on TravisCI clones a fresh copy of the new code and runs our unit tests. Whenever an update fails any of the tests, the team gets a notification. This lets us know that the update is definitely not quite ready yet, and allows us to take action into making sure the fresh code achieves our standards before we pull it into the working copy we share with our users.

One of my first tasks on Waterbear was setting up TravisCI for our unit tests. Much like a Pokémon trainer in Kanto, any open source project worth its salt is nothing without a veritable boatload of build badges. Obviously, my most important contribution to Waterbear was to place the much coveted build badge on our README.

All in a day's work

It was a hard job, but I managed to pull it off.

A word on methodologies

Some people call what I did in my Python example test-driven development (TDD) or if you really want to be pedantic test-first development. Either way, your code lives to serve the test. Under this framework, your code’s only purpose is to ensure that the tests pass. Some people are really adamant about this process, and assert that the only way to know your code will end up working properly is if you write the tests first. I… remain skeptical. It’s certainly a nifty technique, and one I use often, but it’s not the only way to do things. Another method is acceptance testing.

Huzzah! Acceptance Testing!

Acceptance testing is simply defining the behaviour we expect, and under what circumstances should we say that a thing fulfils its duty. Wait… this is sounding familiar… didn’t we just talk about this? Well, kind of. While unit testing focuses on the smaller parts, acceptance testing is much more high level—often done by a living, breathing human, rather than run automatically by a continuous integration robot.

Testing doesn’t have to be automatic or focused on specific, small units of code. Definitely, it’s nice when we can test individual units of code—plus, it’s generally indicative of a cleaner, more modular design—but this doesn’t necessarily say anything about the ultimate awesomeness of the software. Besides, sometimes it’s just straight up difficult to break the code up perfectly in this way. It happens.

Golly

Don’t beat yourself up for it. There are a lot of design decisions to make and you’re never gonna make everything perfect; a 1-1 correspondence between code unit and function is not the ultimate goal of programming: it’s making a system that works! …uh. Whatever that means.

The point of acceptance testing is to define scenarios that define the requirements of your code, i.e., what your code must ultimately achieve to be considered “good”. You define any conditions that must be setup prior to the test, the tangible steps that a person has to undertake to make the scenario happen, and the criteria for saying “yep, this sure did work”. It’s like a checklist, that you check off all of the steps, and at the end, you know that the code works.

This is a relatively new concept for me, so I don’t apply any serious formality to it. I did find myself using this in Waterbear recently to determine whether I was writing the right thing for the Waterbear debugger. Before I started writing any substantial amount of code for the debugger, I had no idea where to start. I legitimately struggled for a while with a file in my editor that just read // debugger. It was… embarrassing.

But a chance encounter with my UCOSP supervisor, Eleni Stroulia, reminded me about acceptance testing—something I had only ever practiced once. So I got to it! I looked at the informal list of requirements that we collected on our issue tracker. Then I edited these initial feature requests into a feature list that was a bit more fleshed out. After this, I got started writing the tests!

The template I followed contained the following sections:

Setup: How to get the system into a state necessary to begin the test. For the Waterbear debugger, most of these came along with an example script that would demonstrate the desired phenomenon.
Preconditions: Any special state that the system must be in prior to the test.
Test: The steps necessary that a user would do to accomplish the given task.
Acceptance criteria: This is the checklist: the list of things that should happen.

Of course, as this type of testing is usually performed by a human and not automatically by a computer, care should be taken in sucking out any subjectivity, vagueness, and ambiguity in the script.

Once everything has been defined, go do it! In my case, writing even half of the tests helped me think of the tasks that I had to accomplish and gave me a good idea of the architecture I had to build in order to reach that goal. After writing enough code to fulfil even one of these task, I’d go test it. And out of the process, I got this illuminating state diagram that displays all of the ways that a user can plausibly go through execution states—it turned out to be way more involved than I expected.

State diagrams get me excited, okay?

(Apologies for the criminal unfunniness of the last still image.)

The end result should be a list that should be clear to follow in order check that the debugger is working properly. And hey! We now have a clear (or rather, more clear) definition of what it means when we say “the debugger is working properly.”

The process of writing the acceptance tests gave me a fresh look at the problem I had, and allowed me to think, and visualize it in different ways; it allowed me to organize the complexity of the task. And for that reason alone, I’d recommend perhaps writing an acceptance test when you have a large task and no idea how to tackle it.

Conclusion

In the end, testing may seem like a mindless process—something that hoighty toighty software engineering types (of which I definitely am one) are always goading other coders into doing. But the fact is, despite the obvious motivation of “checking that it works right”, testing also yields a method for discovering how to solve a problem. And I think that’s pretty neat.

oh yeah

—Eddie

–Eddie

Bonus PROTIP: I like to automatically run my doctests whenever I save my work. I use pytest with the xdist plugin. Install them like so:
```
pip install pytest pytest-xdist
```
Then to start running tests continuously, I open a new terminal and type the following in the same directory as the file I’m working on:
```
py.test -f --doctest-mod
```
Alternatively, to save myself a bit of typing, I put this in my .aliases file (if you don’t know what this, you probably want to put this in your ~/.bash_profile):
```
alias doctest='python -m doctest'
```
Which allows me to run the doctests of any file by simply:
```
doctest FILE.py
```
↩

Written on April 4, 2015