Dr. David J. Pearce

Automated Testing for Whiley


Recently, the online editor for Whiley was updated with some new features. Actually, the update represents a complete rewrite of the front-end in Whiley. Obviously, I am very excited about that! Previously it was written using raw (i.e. ugly) JavaScript, but now uses a framework for Functional Reactive Programming (called Web.wy). That has made a huge difference to how the code looks. Still, I’m not going to talk about that here. Rather, it is the new check feature (highlighted in red below) that I’m interested in:

Illustrating a screenshot of whileylabs.com with the check option highlighted.

What is this new “check” feature then? In a nutshell, it automatically generates test inputs for functions and runs them through the code looking for problems (e.g. divide-by-zero, index-out-of-bounds, etc). In many ways, it is similar to the QuickCheck line of tools (with the added benefit that Whiley has first-class specifications). We can think of it as a “half-way” step towards formal verification. The key is that it is easier to check a program than it is to statically verify it (more on this below). Some might think that having a check feature like this doesn’t make sense when you also have static verification. But, I prefer to think of them as complementary. Thinking about a developer’s workflow, we might imagine checking a function first as we develop it (since this is easier and quicker) before, finally, attempting to statically verify it (since this is harder and may force the specification to be further refined).

Example

As a simple example to illustrate how it works, consider the following signature for the max(int[]) function:

function max(int[] items) -> (int r)
// items cannot be empty
requires |items| > 0
// result not smaller than any in items
ensures all { i in 0..|items| | r >= items[i] }
// result is element in items
ensures some { i in 0..|items| | r == items[i] }:
   ...

We’re not really concerned with the implementation here (and we can assume it might have bugs). To check this function, the tool will do the following:

  1. (Generate) The tool generates some number of valid inputs. That is, input values which meet the precondition. For our example above, that is any int[] array containing at least one element.

  2. (Execute). The function in question is then executed using each of the generated inputs. This may lead to obvious failures (e.g. out-of-bounds or divide-by-zero errors), in which case we’ve already found some bugs!

  3. (Check). For any execution which completed, the tool then checks the result against the postcondition. If the postcondition doesn’t hold then, again, this indicates a bug somewhere.

The key is that the specification acts as the test oracle. In other words, having written the specification we get testing for free! And, the tool takes care of the difficult stuff so we don’t have to. For example, generating valid inputs efficiently is harder than it looks (more on this below).

Technical Stuff

The main technical challenge is that of efficiently generating input values under constraints. To start with, we can generate raw input values for the various data types in Whiley:

Given the ability to generate arbitrary values for a type, the more difficult question is how to generate values which meet some constraint. For example, consider this data type:

type List<T> is {
     int length,
     T[] items
}
where 0 <= length
where length <= |items|

The current approach we take is rather simplistic here. We simply generate all values of the underlying type and discard those which don’t satisfy the invariant. This works, but it has some problems:

  1. (Scale). Enumerating all values of a complex data type upto a given bound can be expensive. We might easily need to generate 1K, 10K, 100K (or more) values, of which only a tiny fraction may satisfy our invariant. For our List example above, assuming a maximum array length of 3 and integer range -3..3, there are 2800 possible values of the underlying type of which only 1534 meet the invariant (55%). That’s not too bad, but as we get more invariants this changes quite quickly. Consider this example taken from an implementation of Tarjan’s algorithm for computing strongly connected compoenents:

    type Data is {
      bool[] visited,
      int[] lowlink,
      int[] index
    }
    where |visited| == |lowlink|
    where |lowlink| == |index|
    

    In this case, there are 2.4M values of the underlying type, of which only 946K match the invariant (39%). And, as we add more invariants (e.g. that no element in lowlink can be negative) this ratio continues a downward trend.

  2. (Sampling). Another interesting issue arises with sampling. Typically, we want to sample from large domains (i.e. rather than enumerate them) to allow checking in reasonable time. The problem is that, for complex data types with invariants, the probability that a given value sampled from the underlying domain meets the invariant is often very low. We have observed situations where sampling 1K or 10K values from such a domain produced exactly zero values meeting the invariant. In other words, it was completely useless in these cases (though not in others).

Whilst these may seem like serious problems, they stem from the simplistic fashion in which we generate values for constrained types. Actually, it is possible to generate values of constrained types directly (i.e. without enumerating those of the underlying type) and efficiently. The next evolution of the check functionaliy will implement this and, hopefully, it will offer some big improvements.

Pros / Cons

There are several advantages to checking over static verification, some of which may be unexpected:

Of course, there are some disadvantages with checking as well. Most importantly, checking doesn’t guarantee to find all problems with your program! But, in our experience, it usually finds most of them. There are also problems (discussed above) with generating enough values for complex data types to properly test our functions.

Conclusion

Overall, despite some limitations, we find checking to be incredibly useful — especially as its free. Here’s a short demo to give you a taste:

So, head on over to the online editor for Whiley and give it a go!


Follow the discussion on Twitter or Reddit