Home Knowledgebase Common Optimizations Using codeplay::ReadArray to improve performance of arrays processing on SPE.

Using codeplay::ReadArray to improve performance of arrays processing on SPE.

Table of Contents

Using codeplay::ReadArray to improve performance of array processing on SPE.

Setting up the project.

Writing some code.

Running executables built with Offload in the Offload for Eclipse IDE.

Offloading PPE code to SPE with __offload { }

Optimizing the array calculation with ReadArray and ReadWriteArray.

Tips for good performance

Alignment

Offload parameters and keeping things on SPE


Results

Using codeplay::ReadArray to improve performance of arrays processing on SPE.


In the situation where an array needs to be processed on SPE and is actually held on the PPE the traffic can cause a bottleneck between the SPE and PPE when the DMA occurs. This is caused by many reads/writes to the array(s) on PPE. The performance will take a massive hit when this happens as the SPE spends most of it's time waiting for data to be written to or fetched from the PPE. The “Offload” system does have a software cache which reduces the time of the sometimes lengthy DMA reads and writes but this is limited in size, so with larger arrays the bottleneck still occurs.

This article will show you how to create a new Offload project in Eclipse for Offload with an example that uses large arrays with many iterations. It will then show you how to offload the array processing and optimize using the codeplay ReadArray template.

Setting up the project.

You will need to download and install “Offload for Cell Linux” from http://offload.codeplay.com. This package contains the OffloadCPP compiler, CellPlayer debugger, PS3 toolchain and Eclipse for Offload IDE. All the tools needed to build and run code entirely remotely on a PS3 with Linux installed. The only prerequisite of this is that Cygwin is installed. Ensure also that the Cygwin commands “ed” and “make” are installed, as these are needed by the offload system. If you do not have these, the installer will tell you.

In order to do debugging remotely on the PS3 a script has to be copied to the PS3 and ran from the command line. Please note that this only has to be copied across once. The script gdbservice.py is in

C:\Program Files\Codeplay\Offload for Cell Linux\Player Debugger

Use scp to copy the script to the PS3 remotely by doing something like:

scp gdbservice.py user@ps3linux:.

Where “user” is a valid user on the system and “ps3linux” is the name of the PS3 with Cell Linux installed.

Now ssh into the PS3 as “user” with:

ssh user@ps3linux

Then run the python script to enable remote debugging:

python gdbservice.py &

This will run the script in the background. If for any reason the script needs to be killed off the python process can be killed by doing kill <pid of the python process> and the “pid” of python can be found by doing “ps”.




Open Eclipse for Offload and make a new project by going to File > New > Project. Then expand the “C/C++” submenu and select the option for “C++ Project”. Then choose “Next”

In the next screen, enter a suitable project name. Under the “Executable” submenu select “Offload C++ Project” and in toolchains to the right select “Offload C++”. Press “Finish”.

You now may be asked if you wish to switch to the C/C++ perspective. Choose yes here. It may also be worth ticking the box that says “Remember my decision” on the bottom left of this dialog.

A project will now be setup containing a very simple cpp file. This project, though it just returns 0, will compile and link.


Expand the project directory to find the src folder. The cpp file will have a simple main with return 0. “Eclipse for Offload” should have already built the project and generated the “Debug” folder and its contents, the elf file. This is the executable file that will run on the PS3.



Writing some code.


Above main(), add the following class that does the timing as well as defining ITERATIONS and MAX_WIDTH. Above main() paste the following code for the timer class.

#include <sys/time.h>
#include <stdlib.h>
#include <math.h>
#include <offloadutils>

#define MAX_WIDTH 128

#define ITERATIONS 100


typedef float element_t;


static element_t a[MAX_WIDTH] __attribute__((aligned(16)));

static element_t b[MAX_WIDTH] __attribute__((aligned(16)));

static element_t c[MAX_WIDTH*MAX_WIDTH] __attribute__((aligned(16)));

static element_t d[MAX_WIDTH*MAX_WIDTH] __attribute__((aligned(16)));

static element_t e[MAX_WIDTH*MAX_WIDTH] __attribute__((aligned(16)));


static element_t l[MAX_WIDTH] __attribute__((aligned(16)));

static element_t f[MAX_WIDTH] __attribute__((aligned(16)));

static element_t g[MAX_WIDTH] __attribute__((aligned(16)));

static element_t h[MAX_WIDTH] __attribute__((aligned(16)));



extern "C" int gettimeofday (struct timeval *, void *);


class timer {

private:

    struct timeval t1;

        struct timeval t2;

        const char* msg;

        void time_int(int iters, int print=0)

       {

            gettimeofday(&t2, NULL);

           if( print == 1 ) {

               double d1 = t1.tv_sec+(t1.tv_usec/1000000.0);

              double d2 = t2.tv_sec+(t2.tv_usec/1000000.0);

              printf("%s Time spent [%.3fms] \n", msg, (((double)(d2-d1)*1000.0))/(double)iters);

          }

          t1 = t2;

       }

public:


       timer(const char* m) : msg(m)

      {

      }


      void start() {

          time_int(0);

      }

      void stop(){

           time_int(ITERATIONS,1);

      }

};



Note that the above code also declares the arrays a, b, d, e, l, f, g and h globally, which are used in the body of the loops to read and write to multiple times.

The additional aligned(16) attribute is needed when copying memory between the PPU and SPU.

We are going to create two arrays and fill them with random numbers. We will then iterate over the arrays doing a nonsense calculation on them while measuring the time taken to do this.

Select the entire main() function and replace it with the following code.

 



    int main() {

        /* Initialise the input arrays */

        for (int i = 0; i < MAX_WIDTH ; i++){

            a[i] = rand() % 10 -1;

            b[i] = rand() % 11 -1;


            l[i] = rand() % 11 -1;

            f[i] = rand() % 11 -1;

            g[i] = rand() % 11 -1;

            h[i] = rand() % 11 -1;

        }


    /* Initialise the Offload runtime to eliminate startup costs subsequently */

        __offload {}


        timer timePPU("Time on PPU");

        timePPU.start();



        for (int k = 0; k < ITERATIONS; k++)

           for (int i = 0; i <MAX_WIDTH; i++)

              for (int j = 0; j<MAX_WIDTH; j++)

                   c[i*MAX_WIDTH+j] = (b[i] + k * k + a[i]) - l[i]+(l[i]/2)+(h[i]/3) * cos(g[i]+b[50]);

 

        timePPU.stop();


   }


At the moment, when this code is executed it will only run on PPE and should be reasonably fast. We are just doing this to show the differences in performance between PPE only, offloaded and offloaded with ReadArray optimizations.


To achieve the best performance we must build our example in release mode, so click the small downward pointing arrow next to the hammer to select “Release”. The example will now build a file with extension “.elf” containing no debug info. Now we should run this with the debugger.

Running executables built with Offload in the Offload for Eclipse IDE.

Now we are going to launch the “Player Debugger” to simply run the Linux executable we just made in the previous step. We are not going to explain source level debugging in this article; we are just going to use the debugger to run what we have written. Note that the code can be executed on the PS3 by copying the compiled elf across with scp, as was done for the gdbservice.py script earlier. If this is the preffered option, run it by doing "./test.elf" after logging in with ssh.

Simply click the “Launch Codeplay Player Debugger” button which is on the ToolBar in Eclipse. Look for the coloured Codeplay Offload logo and click it, this will launch the Player Debugger.




The first thing to do on the Cell Player debugger is to establish a connection with the PS3. We need to add the details of the PS3 to use to the debugger. In the ToolBar of the Cell Player debugger there is a white drop down selction called “Cell Linux Remote Hosts”. Select this and choose “Add”.


In the box that appears, type the name of the PS3 system and click “OK”. It will now be added, but we need to select the system we just added. Choose the menu again and click the name you just entered. The PS3 system will now appear to be selected and it's status will be indicated with the word “ Ready” appended to it.



To now launch the executable click “File” then “Open executable" and choose the elf that was just built to run it. By default the debugger will stop at main. Do not be worried when you do not see the source code of the example, this is because we built the example in release mode. For now, there will be no debugging information. We simply want to run it just now.


Note that if we selected to run a Debug build of the example in the Eclipse for Offload IDE, the source code would be shown. As it is stopped at main, we need to resume the process, so press the green “Run” button to continue running the program. Any standard output will appear at the bottom right of the debugger under the “Output” tab.



The example for PPE only should now have ran on the PS3 and the output should be displayed in the output window on the Cell Player debugger. That example ususally takes about 4.3ms to run on PPE only. This document will now show how to offload that example onto SPU then show how to optimize the offloaded code with the Codeplay ReadArray and WriteArray templates.



Offloading PPE code to SPE with __offload { }

Offloading from PPE to SPE with “offload” requires the programmer to wrap the code they want to offload in an __offload block. In the case of the example we have been working on this is simple.



     timer timeSimple("Time with Simple SPU Offload");

     timeSimple.start();


     __offload {

     for (int k = 0; k < ITERATIONS; k++)

        for (int i = 0; i <MAX_WIDTH; i++)

           for (int j = 0; j<MAX_WIDTH; j++)

                 d[i*MAX_WIDTH+j] = (b[i] + k * k + a[i]) - l[i]+(l[i]/2)+(h[i]/3) * cos(g[i]+b[50]);

     }

   timeSimple.stop();



Wrap only the code you want to run on SPE in the offload block and note the __offload keyword does not require any special inclusions for it to work. This type of offload is a “blocking offload”. Meaning the PPE waits for the SPE to finish before it continues processing. This is a serial operation. For parallel operations an asynchronous offload thread can be used, however this will not be talked about in this document. Remember to instantiate a new timer and start it before the offload block and stop it just after.

Run the example again. The times for PPE only and SPE with offload should be shown. Usually this is about 27.5ms – nearly 6x slower than PPE only. As explained, the bottleneck is caused by the SPE waiting for the DMA from the PPE. As the amount of reads/writes in a calculation increases this has further effects on performance. We use ReadArray and WriteArray template classes; the former fetch data in blocks from an array in the PPE memory to the SPE to be read; the latter additionally ensures that changes made to data held on the SPE are written back to the array in global memory.

Optimizing the array calculation with ReadArray and WriteArray.

The codeplay ReadArray fetches blocks from a given array which is on the PPE and gives them to the SPE. A template of the following form is used to create a temporary on SPE which is then read more efficiently from within the loop:

codeplay::ReadArray<element_t, MAX_WIDTH> Ca(&a[0]);

Where element_t is the type, MAX_WIDTH is the size of the array and Ca is an object of the ReadArray class that will now be used instead of array a. A pointer to the starting point has to be passed in here, in this case we have given the reference of the first element. We would use Ca inside the loop instead of a. The constructor for ReadArray can be used in under gcc also, this offers compatibility between platforms.

Construct a ReadArray for all the arrays that are read inside the loop and a WriteArray for the array that's written to. These should be constructed inside the offload block but before the loop.


     __offload {

    codeplay::ReadArray<element_t, MAX_WIDTH> Ca(&a[0]);
    codeplay::ReadArray<element_t, MAX_WIDTH> Cb(&b[0]);
    codeplay::ReadArray<element_t, MAX_WIDTH> Cl(&l[0]);
    codeplay::ReadArray<element_t, MAX_WIDTH> Cf(&f[0]);
    codeplay::ReadArray<element_t, MAX_WIDTH> Cg(&g[0]);
    codeplay::ReadArray<element_t, MAX_WIDTH> Ch(&h[0]);
    codeplay::WriteArray<
element_t, MAX_WIDTH*MAX_WIDTH> Ce(&e[0]);

        for (int k = 0; k < ITERATIONS; k++)

       {

            for (int i = 0; i <MAX_WIDTH; i++)

           {

               for (int j = 0; j<MAX_WIDTH; j++)

                   Ce[i*MAX_WIDTH + j] = (Cb[i] + k * k + Ca[i]) - Cl[i]+(Cl[i]/2)+(Ch[i]/3) * cos(Cg[i]+Cb[50]);

           }

      }

    }

In the code above all the arrays have been replaced with ReadArrays. The write from the SPE to the PPE is just as important as the reads for improving performance. This is because the WriteArray DMAs the data written to it from SPE to PPE when it's destroyed and as it's declared locally this will be when the outer loop is exited. That is why the WriteArray is constructed just inside the outer loop.

Add in the timing class for this offload block, again starting just before the offload block and ending just after it.

Building and running this should show that the time for the newly optimized and offloaded array code with codeplay ReadArray and WriteArrays runs faster than PPE only by about a factor of 2x.

 

 Results



The graph below shows the results from the above example. It shows that with simple usage of codeplay::ReadArray and WriteArray templates a speed increase of two times (2x) on one SPE can be obtained over that of the same code being run on PPE.

 

Tips for good performance

There are some other ways of squeezing that tiny bit more speed out of offloaded code. This small section will describe some.


Offload parameters and keeping things on SPE

When offloading with __offload, parameters can be passed with () at the end of the __offload keyword in order to help improve performance. Local variables (including function parameters) declaring outside an offload block are referred to as outer stack locals. __offload block arguments allow for these variables to be pushed to the SPE allowing for fast access withing the __offload block. In the example below, with a blocking offload block reading and writing to the argument(s) is allowed. PPE globals do not need to be put into offload block arguments.

     int local;

     int *pointer; //outer stack locals

     __offload (local, pointer) {

         local = local + 1; // Legal, local is modifiable in a synchronous block

    } // local is copied back from SPU to PPU here

 

For something that is only ever used on SPE (within the offload block) rather than declaring it in PPE then using it on SPE, simply move it to SPE by declaring it inside the offload block. This removes the need for a DMA from PPE to SPE. However this cannot be done if an outer stack local is being read/written on PPE. The following three examples of the function foo() show a simple offload with an outer stack local being read from SPE, the same outer stack local being passed in as an offload block parameter to speed things up and the last show's the fastest option: moving int i to be declared on SPE local store.

 

A simple offload with an outer stack local being read from SPE:

    void foo()

       { // Serial SPE offload

             int i;

              __offload {

                       for (i = 0; i < 10; i++)

                       {

                             // stuff

                       }

              }

       }

Outer stack local being passed in as an offload block parameter:

      void foo()

      {

           int i;

           // Make i a parameter for faster access

           __offload (i) {

                   for (i = 0; i < 10; i++)

                   {

                           // stuff

                   }

          }

      }

Moving int i to be declared on SPE local store:

     void foo()

     {

          // Better still, move declaration into offload block

           __offload {

               for (int i = 0; i < 10; i++)

               {

                      // stuff

               }

          }

     }