# Automated Testing for Whiley

Recently, the online editor for Whiley was updated with some new features. Actually, the update represents a *complete rewrite of the front-end in Whiley*. Obviously, I am very excited about that! Previously it was written using raw (i.e. ugly) JavaScript, but now uses a framework for Functional Reactive Programming (called Web.wy). That has made a huge difference to how the code looks. Still, I’m not going to talk about that here. Rather, it is the new *check* feature (highlighted in red below) that I’m interested in:

*What is this new “check” feature then?* In a nutshell, it automatically generates test inputs for functions and runs them through the code looking for problems (e.g. *divide-by-zero*, *index-out-of-bounds*, etc). In many ways, it is similar to the QuickCheck line of tools (with the added benefit that Whiley has first-class specifications). We can think of it as a “half-way” step towards formal verification. The key is that it is easier to check a program than it is to statically verify it (more on this below). Some might think that having a check feature like this doesn’t make sense when you also have static verification. But, I prefer to think of them as *complementary*. Thinking about a developer’s workflow, we might imagine checking a function first as we develop it (since this is easier and quicker) before, finally, attempting to statically verify it (since this is harder and may force the specification to be further refined).

## Example

As a simple example to illustrate how it works, consider the following
signature for the `max(int[])`

function:

```
function max(int[] items) -> (int r)
// items cannot be empty
requires |items| > 0
// result not smaller than any in items
ensures all { i in 0..|items| | r >= items[i] }
// result is element in items
ensures some { i in 0..|items| | r == items[i] }:
...
```

We’re not really concerned with the implementation here (and we can assume it might have bugs). To check this function, the tool will do the following:

**(Generate)**The tool generates some number of valid inputs. That is, input values which meet the precondition. For our example above, that is any`int[]`

array containing at least one element.**(Execute)**. The function in question is then executed using each of the generated inputs. This may lead to obvious failures (e.g. out-of-bounds or divide-by-zero errors), in which case we’ve already found some bugs!**(Check)**. For any execution which completed, the tool then checks the result against the postcondition. If the postcondition doesn’t hold then, again, this indicates a bug somewhere.

The key is that the specification acts as the test
oracle. In other words,
*having written the specification we get testing for free*! And, the
tool takes care of the difficult stuff so we don’t have to. For
example, generating valid inputs efficiently is harder than it looks
(more on this below).

## Technical Stuff

The main technical challenge is that of efficiently generating *input
values under constraints*. To start with, we can generate raw input
values for the various data types in Whiley:

**Primitives.**These are pretty easy. For example, to generate values of type`int`

, we’ll use some (configurable) domain (e.g.`-2 .. 2`

) of values. We can easily sample uniformly from this domain as well (e.g. using Knuth’s Algorithm S).**Arrays**. Here, we limit the maximum length of an array (e.g. length`2`

), and then enumerate all arrays upto this length. To generate values in the array, we recursively call the generator associated with the element type. For the array type`int[]`

and assuming max length`2`

and integers`-1..1`

, we might generate:`[]`

,`[-1]`

,`[0]`

,`[1]`

,`[-1,-1]`

,`[0,-1]`

,`[1,-2]`

, etc.**Records**. These are similar, except there is no maximum length. For each field, we recursively generate values using the generator associated with its type. For the record type`{bool flag, int data}`

and assuming and integers`-1..1`

, we might generate:`{flag:false,data:-1}`

,`{flag:true,data:-1}`

,`{flag:false,data:0}`

,`{flag:true,data:0}`

,`{flag:false,data:1}`

, etc.**References**. These are more tricky as we must allow for aliasing between heap locations. For a single reference type`&int`

, we generate heap locations for each of the possible integer values. However, for two reference types (e.g. parameters`&int x, &int y`

), we have to additionally consider both the*aliased*and*non-aliased*cases.**Lambdas**. These are very challenging as, in principle, it requires enumerating all implementations of a function! Consider the lambda type`function(int)->(int)`

—*how do we generate values (i.e. implementations) of this type?*There is no easy answer here, but there are some strategies we can use. Firstly, we can scour the source program looking for functions with the same signature. Secondly, we can generate simple input/output mappings (e.g.`{-1=>-1, 0=>0, 1=>1}`

). We can go beyond this by applying*rotations*(e.g.`{-1=>0, 0=>1, 1=>-1}`

).

Given the ability to generate arbitrary values for a type, the more difficult question is how to generate values which meet some constraint. For example, consider this data type:

```
type List<T> is {
int length,
T[] items
}
where 0 <= length
where length <= |items|
```

The current approach we take is rather simplistic here. We simply
generate all values of the *underlying type* and discard those which
don’t satisfy the invariant. This works, but it has some problems:

*(Scale)*. Enumerating all values of a complex data type upto a given bound can be expensive. We might easily need to generate*1K*,*10K*,*100K*(or more) values, of which only a tiny fraction may satisfy our invariant. For our`List`

example above, assuming a maximum array length of`3`

and integer range`-3..3`

, there are*2800*possible values of the underlying type of which only*1534*meet the invariant (55%). That’s not too bad, but as we get more invariants this changes quite quickly. Consider this example taken from an implementation of Tarjan’s algorithm for computing*strongly connected compoenents*:`type Data is { bool[] visited, int[] lowlink, int[] index } where |visited| == |lowlink| where |lowlink| == |index|`

In this case, there are

*2.4M*values of the underlying type, of which only*946K*match the invariant (39%). And, as we add more invariants (e.g. that no element in`lowlink`

can be negative) this ratio continues a downward trend.*(Sampling)*. Another interesting issue arises with sampling. Typically, we want to sample from large domains (i.e. rather than enumerate them) to allow checking in reasonable time. The problem is that, for complex data types with invariants, the probability that a given value sampled from the underlying domain meets the invariant is often very low. We have observed situations where sampling*1K*or*10K*values from such a domain produced*exactly zero*values meeting the invariant. In other words, it was completely useless in these cases (though not in others).

Whilst these may seem like serious problems, they stem from the simplistic fashion in which we generate values for constrained types. Actually, it is possible to generate values of constrained types directly (i.e. without enumerating those of the underlying type) and efficiently. The next evolution of the check functionaliy will implement this and, hopefully, it will offer some big improvements.

## Pros / Cons

There are several advantages to checking over static verification, some of which may be unexpected:

*(Incomplete Specifications)*. We can check our programs even when the specifications are incomplete. Suppose we are developing two functions in tandem, say`max(int[])`

which uses`max(int,int)`

as a subroutine. With static verification, we cannot verify`max(int[])`

before we have specified and verified`max(int,int)`

. However, we can start checking`max(int[])`

as soon as we have implementated`max(int,int)`

(i.e. since checking just executes it).*(Loop Invariants)*. We can check our programs without writing any loop invariants. Since writing loop invariants can be challenging, this is an important benefit. With static verification, we cannot verify a function containing a loop without correctly specifying the loop invariant first. That can be a real pain, especially in the early stages of developing a program.*(Counterexamples)*. Whenever checking finds a problem in our program, it will also tell you the input values causing it and, furthermore, they always correspond to real failures in your program. Static verification, on the other hand, cannot always generate an execution trace and, even when it does, they are not always*realisable*traces of the program. That can make life quite frustrating!

Of course, there are some disadvantages with checking as well. Most importantly, checking doesn’t guarantee to find all problems with your program! But, in our experience, it usually finds most of them. There are also problems (discussed above) with generating enough values for complex data types to properly test our functions.

### Conclusion

Overall, despite some limitations, we find checking to be incredibly
useful — *especially as its free*. Here’s a short demo to give you
a taste:

So, head on over to the online editor for Whiley and give it a go!

*Follow the discussion on Twitter or Reddit*