Parallel Sum in Whiley

Wednesday, August 3rd, 2011

Recently, I’ve been working on a variety of sequential and concurrent micro benchmarks for testing Whiley’s performance. An interesting and relatively simple example, is the parallel sum. The idea is to sum a large list of integers whilst performing as much work as possible in parallel.

To implement the parallel sum, I divide the list into roughly equal sized chunks and assign one process to each:

define Sum as process {
    [int] items,
    int start,
    int end,
    int result
}

void Sum::start():
    sum = 0
    for i in start..end:
        sum = sum + items[i]
    this.result = sum

int Sum::get():
    return result

// Sum constructor
Sum ::Sum([int] is, int s, int e):
    return spawn {
        items: i,
        start: s,
        end: e,
        result: 0
    }

Essentially, each Sum process holds the original list of items and a range start..end over which to operate. The result is used to store the final sum until it is requested by the outer loop. The idea is that we first construct the processes, then start them all asynchronously and, finally, collect up the results.

The outer loop looks something like this:

define N as 100 // block size to use

int ::parSum([int] items):
    while |items| != 1:
        // Calculate how many workers required
        nworkers = max(1,|items| / N)
        size = |items| / nworkers
        // Construct and start workers
        pos = 0
        workers = []
        for i in 0..nworkers:
            if i < (nworkers-1):
                worker = Sum(items,pos,pos+size)
            else:
                // Last worker picks up the slack
                worker = Sum(items,pos,|items|)
            // Start worker asynchronously
            worker!start()
            // Bookkeeping
            workers = workers + [worker]
            pos = pos + size
     // Collect up results
     items = []
     for i in 0 .. nworkers:
         items = items + [workers[i].get()]
 return items[0]

The key here is that the outer loop continues until the original list of items is reduced to a single result. There maybe several iterations required, depending on the block size. Furthermore, the block size determines how many items will be processed by each process in one go. Smaller block sizes lead to more parallelism, but have higher overheads. The optimal block size probably depends on the underlying architecture, and would ideally be chosen at runtime.