Intel® Threading Building Blocks (TBB) is a multi-platform task based library for programming multi-core processors in C++. In this case study, we demonstrate how a program written for the TBB API can be run on the Cell BE platform under Linux using Codeplay Offload C++ to target both the PPU and SPU processors.
The Seismic example program is from the TBB source code distribution. It is a parallel seismic simulation that computes wave propagation using the parallel_for construct from the TBB API, with a serial implementation also provided to allow performance comparisons to be made.
The existing Seismic example can work with the Mac, Windows, and Linux/X11 GUI systems; we have implemented a Linux frame buffer back end for the video system used by the demonstration to run on Cell Linux on a PS3 console. In porting the example to Cell Linux and the frame buffer, we have removed windows specific code from the original example i.e. code guarded with preprocessor macros, and made other minor changes. We refer in this case study to the changes required to allow the code to run on the Cell PPU and SPU processors, and do not comment further on general porting issues.

Source code for the individual program versions discussed here can be downloaded via the links in the resources section.
The main loop of the simulation invokes the SerialUpdateUniverse function.
The serial simulation update methods are nested loops across two dimensional data arrays; the SerialUpdateStress method additionally sets pixel values in the output display.
Such serial code when run on the Cell processor executes on the PPU core, and does not take advantage of the Cell's SPU accelerator cores (6 of which are usable on a Cell Linux system running on a PS3 games console). Our next step is to offload the computationally expensive parts of the program onto one SPU processor, before optimizing its performance, and then turning our attention to the parallel version to exploit all 6 available SPU processors.
The Cell processors DMA system operates on aligned data; a first alteration to the program is to add GNU syntax alignment attributes to the declarations of the global arrays holding the model data e.g.
The offload block causes the compiler to generate program code for the SPU for the code within the offload block, and to generate code to spawn a task on the SPU on entry to the offload block. Likewise, the SPU task is terminated on leaving the scope.
While the code is running on the SPU processor, the data being processed is still held in the global memory of the Cell, and not in the SPU's fast local store. This is not desirable, as the SPU will be idle waiting for the data to be fetched from or to be written to global memory. While there is a software cache system to avoid stalling caused by performing DMA transfers on each data access, it is limited in size, imposes an overhead on writes, and must still perform DMA transfers to cope with cache misses and cache evictions. The naive offload above runs at about 1/50th of the performance of the original serial code; data accesses are a severe bottleneck.
To avoid this bottleneck, we can use the Cell SPU DMA intrinsics to perform bulk data transfers between SPU and PPU. While these can be programmed directly, they are low level, non-type safe, and provide no abstractions for the programmer. The Offload C++ system provides a set of type safe template classes (more specifically specializations of these ) to wrap the DMA intrinsics on SPU, and to provide convenient support for data access use cases. In this case, we use ReadArray and ReadWriteArray template classes; the former fetch data in blocks from an array in the PPU memory to the SPU to be read; the latter additionally ensures that changes made to data held on the SPU are written back to the array in global memory. In the following example, the writes into lV are written back to global memory using this mechanism. The ReadArray and ReadWriteArray templates are used here to fetch entire rows of data from the 2d arrays in global memory onto the SPU local store.
The code using the global arrays is simple to alter to refer to the local arrays instead. Making these changes has a substantial effect on performance - the offload onto a single SPU is now running at about 60% of the speed of the original serial code. We have taken existing serial code, and offloaded it to execute on a single SPU processor of the Cell. We could optimize single SPU execution further, using SIMD instructions or double buffering data transfers, however now we turn our attention to offloading existing parallel code using the TBB API to execute both on PPU and SPU processors on the Cell.
The TBB approach to dividing up the iteration space of a loop across multiple tasks and processors is to specify the body of the loop as a C++ function object that defines an operator() method to specify the operations to apply to each element of the iteration space in a given range.
These function objects are then passed as arguments to the tbb::parallel_for function, which takes as arguments an iteration space and the function object to apply to points in the iteration space. The blocked range models a 1D iteration that can be recursively subdivided, down to some lower bound, typically chosen so that sequential execution of the remaining iterations is more efficient than further parallel work sharing.
The TBB library is not available for Cell Linux and SPUs, so we must implement support for running TBB code ourselves. To do so, we implement a small (~100 lines) header file to implement parallel_for Offload C++ and use that instead of the TBB headers and library.
First, from the code we can see we need implementations of tbb::parallel_for and tbb::blocked_range. We can take these from the appropriate TBB header files (taking care to enclose the templates in appropriate namespaces). The tbb::blocked_range template is used as is. The parallel_for function template we use is shown below. There are additional overloads allowing the user to specify how work is to be partitioned across multiple processors, but we implement the default version of parallel_for here. We can see that in order to implement parallel_for, we must also provide an implementation of internal::start_for.
The start_for implementation in the TBB header files implements a recursive divide and conquer approach to task creation, where the iterations are divided and subdivided into tasks to execute in parallel. We need to provide an implementation of the static run method where work is divided between the SPU processors and the PPU, and this is what is done below.
There are 6 SPU processors accessible on Cell Linux, plus the PPU processor, so the total number of threads we will divide work between is 7, six of those being offloaded onto the SPU processors. The body of run spawns offload threads parameterised with a range and the function object to apply; it then also loops over a range of work on the PPU, applying the function object, before finally awaiting the completion of each offload thread in the offloadThreadJoin operation.
With the TBB approach of passing function objects into template classes and template functions, the functions to invoke are all statically known. Therefore, the Offload C++ compiler is able to automatically compile the function object operator() routine for the SPU and for the PPU, generating the data transfer code needed to move data between global and SPU memory. However, as in the serial version, the data transfer intensive nature of the code results in non-optimal performance on the SPU. Therefore, we can apply the same techniques of wrapping DMA intrinsics in typesafe data transfer classes to the code in the function object. Use of the DMA templates here improves the parallel performance of the HD version from 20% of the serial code to 5.9 times faster than the serial code.
struct UpdateVelocityBody {
Since the code is being executed on both the PPU and the SPU, it must be possible to use the ReadArray and ReadWriteArray template classes on both. The actual data transfer implementations used by these classes are statically selected via template specialization to use the builtin memcpy operations on PPU (e.g. outside of an offload block) and the SPU DMA intrinsics inside the offload block.
We have shown how the TBB parallel_for construct can be readily used to distribute work accross the Cell Processors SPU and PPU cores, using a simple static work division. It would be possible to implement a more sophisticated work division scheme, where worker threads are able to dynamically task-steal. Our proof of concept header could be greatly extended to cover more of the TBB API.
This case study has demonstrated how existing serial and parallel C++ code can be run on the Cell processor offloading execution onto the SPU processors. A small compatibility header is used to provide an implementation of the TBB parallel_for construct that splits work between the PPU and SPU processors, with the Offload C++ compiler using automatic call-graph duplication to compile code to execute both on the PPU and SPU processors, obtaining a 5.9 times performance increase with relatively little effort.
When individual data transfers between PPU and SPU memory prove inefficient, bulk transfers are easily accomplished using template classes to wrap DMA operations on SPU. Statically selecting data transfer code via template specialisation at compile time to use memcopy on PPU allows the same optimised code to execute unmodified on PPU and SPU.