Moving code from a CPU to a special-purpose accelerator (like an SPU) can give huge performance benefits, but there can be a lot of work involved. One of the biggest areas of work is adapting code so that it can handle data being in a combination of local memory and shared memory. In practice, this requires duplicating functions in some way, either by passing the same source through different compilers multiple times, or by creating special macros. This is highly error-prone and makes code difficult to maintain. Instead, Codeplay Offload uses an automatic system called "Call-Graph Duplication" to do this work for the programmer.

The key benefits of Call-Graph-Duplication are:

  1. Reduces the amount of work required to move code onto an accelerator processor.
  2. Allows programmers to keep their source code portable and easy-to-maintain.
  3. Removes the requirement to hand-duplication functions for different processor cores and memory spaces.

Consider a simple example:

int f(int* a, int* b) {
return *a + *b;
}

In this example, the function 'f' can be called from PPU and from SPU. So, we need to make sure that if the function 'f' is called from an SPU offload block, then it is compiled, linked and available in SPU code memory when that SPU offload block is called. But, when called on SPU, we also have a further duplication problem: the pointers 'a' and 'b' could themselves point to either shared memory or SPU-local memory. There are 4 different combinations of memories that the function 'f' could be called from on SPU. For example, we could call 'f' from SPU on a shared object and a local object, like this:

int ppu_function () {
    int x; // x is now in shared memory
    __offload {
        int y; // y is now in SPU-local memory
        y = f (&x, &y);
          // 'f' is being called with a shared-memory-pointer and a local-memory-pointer
    }
}

Without call-graph-duplication, we would have to create 2 different versions of the function 'f'. With call-graph-duplication, we only need one version of function 'f' and the compiler does the rest. This leaves the code more maintainable, easier to read, and much easier to write. It is especially useful when offloading a lot of code onto SPU.

Call-graph duplication is also useful if we decide that (maybe for performance reasons) we want to move the variable 'x' from shared memory to local memory. Without call-graph-duplication, this would involve changing the implementation of 'f', but with call-graph-duplication, the compiler works out that 'f' needs to be compiled slightly differently.

This is what the function 'f' might look like without call-graph-duplication:

int f(int* a, int* b) {
    return *a + *b;

}

int spu_f(int* a, int* b) {
return *a + *b;
}

int spu_f_local_global (int* a, __ea int* b) {
return *a + fetch_int (b);
}

int f_global_local(__ea int* a, int * b) {
return fetch_int (a) + *b;
}

int f_global_global(int* a, int* b) {
return fetch_int (a) + fetch_int (b);
}

If automatic duplication doesn't give you the performance you need, then you can manually over-ride the default behaviour. For example, if you want to modify the 'f' function to use a different mechanism to read the pointer 'a' in the case where 'a' is in shared memory, you could write the following:

__offload int f(int __outer * a, int* b) {
    return fetch_manual (a) + *b;

}

The Offload tool will choose to use this implementation of 'f' in the case where 'a' is an 'outer' (or shared memory) pointer, and use the normal implementation of 'f' when 'a' is a local pointer, or when called from PPU. This allows performance optimizers to write special high-performance versions of critical functions without having to modify the main source code, or lose source portability.