Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL

17 June 2021



Presented at IWOCL and SYCLcon 2021

Presented in 1979, BLAS is, to this day, the de-facto standard for low-level linear algebra routines. BLAS provides essential linear algebra routines used in various domains such as numerical and scientific computing, weather simulation, computational fluid dynamics, machine learning and adopted for a broad range of hardware from HPC to embedded systems and AI specialized accelerators.

While originally BLAS routines have been implemented for CPU, with the emergence of GPGPU BLAS routines had to be re-written to exploit the provided extensive computational power. Machine learning is rapidly changing this landscape again by incentivizing the development of specialized hardware that can perform certain operations more efficiently. With various range of hardware, having different memory hierarchy, different cache line size, and various memory access pattern, with different number of registers and different type of memory connections, performance portability of BLAS routine across various platforms while avoiding rewrites of existing code is a major challenge of the heterogeneous programming world.

Written in SYCL programming Language, SYCL-BLAS is an open-source BLAS library that provides performance portability across various SYCL-enabled platforms.

This paper presents the implementation of a parametric tile-based TRSM routine for SYCL-BLAS by employing a highly optimized GEMM routine provided in SYCL-BLAS.

Our results shows that we can achieve up to 2.6x speedup on Intel GPU, 7x on AMD GPU and up to 3.4x speedup on ARM GPU compared with the highly optimized clBLAST and clBLAS libraries by tuning the tile size per-device without reimplementing the kernel.

Rod Burns's Avatar

Rod Burns

VP Ecosystem