OpenCLRunTime

Erik Schnetter <eschnetter@perimeterinstitute.ca>

May 17, 2012

Abstract

Executing OpenCL kernels requires some boilerplate code: One needs to choose an OpenCL platform and device, needs to compile the code (from a C string), needs to pass in arguments, and finally needs to execute the actual kernel code. This thorn OpenCLRunTime provides a simple helper routine for these tasks.

1 Overview

Thorn OpenCLRunTime performs the following tasks:

At the moment, OpenCLRunTime only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. (The main reason for this is that the cctkGH entries on the device are not updated; however, this should be straightforward to implement.)

2 Example

An OpenCL compute kernel may be called as follows:

  char const *const groups[] = {  
    "WaveToyOpenCL::Scalar",  
    NULL};  
 
  int const imin[] = {cctk_nghostzones[0],  
                      cctk_nghostzones[1],  
                      cctk_nghostzones[2]};  
  int const imax[] = {cctk_lsh[0] - cctk_nghostzones[0],  
                      cctk_lsh[1] - cctk_nghostzones[1],  
                      cctk_lsh[2] - cctk_nghostzones[2]};  
 
  static struct OpenCLKernel *kernel = NULL;  
  char const *const sources[] = {"", OpenCL_source_WaveToyOpenCL_evol, NULL};  
  OpenCLRunTime_CallKernel(cctkGH, CCTK_THORNSTRING, "evol",  
                           sources, groups, NULL, NULL, NULL, -1,  
                           imin, imax, &kernel);

The array groups specifies which grid functions are to be available in the OpenCL kernel. This is a C array terminated by NULL. (This information could instead also be gathered from the respective schedule.ccl declarations.)

The integer arrays imin and imax specify the iteration bounds of the kernel. This information is necessary so that OpenCL can properly map this iteration space onto the available OpenCL groups (threads).

The array sources (a C array terminated by NULL) specifies the actual source code for the kernel. The first string (here empty) can contain declarations and definitions that should be available outside the kernel function. The second string specifies the actual kernel code, excluding the actual function declaration which is inserted automatically. This is an example for such a kernel code:

// Grid points are index in the same way as for a CPU  
// Using ptrdiff_t instead of int is more efficient on 64-bit  
// architectures  
ptrdiff_t const di = 1;  
ptrdiff_t const dj =  
  CCTK_GFINDEX3D(cctkGH,0,1,0) - CCTK_GFINDEX3D(cctkGH,0,0,0);  
ptrdiff_t const dk =  
  CCTK_GFINDEX3D(cctkGH,0,0,1) - CCTK_GFINDEX3D(cctkGH,0,0,0);  
 
// Coordinates are calculated in the same as as for a CPU  
CCTK_REAL const idx2 = 1.0 / pown(CCTK_DELTA_SPACE(0), 2);  
CCTK_REAL const idy2 = 1.0 / pown(CCTK_DELTA_SPACE(1), 2);  
CCTK_REAL const idz2 = 1.0 / pown(CCTK_DELTA_SPACE(2), 2);  
CCTK_REAL const dt2  = pown(CCTK_DELTA_TIME, 2);  
 
// Note: The kernel below is not vectorised (since it doesn’t use  
// CCTK_REAL_VEC). Therefore, vectorisation must be switched off in  
// the paramter file (via OpenCLRunTime::vector_size_x = 1).  
 
// This loop macro automatically parallelizes the code  
// imin[] and imax[] are passed from the host  
LC_LOOP3(evol,  
         i,j,k, imin[0],imin[1],imin[2], imax[0],imax[1],imax[2],  
         cctk_lsh[0],cctk_lsh[1],cctk_lsh[2])  
{  
  // Calculate index of current point  
  ptrdiff_t const ijk = di*i + dj*j + dk*k;  
 
  CCTK_REAL const dxxu = idx2 * (u_p[ijk-di] - 2.0 * u_p[ijk] + u_p[ijk+di]);  
  CCTK_REAL const dyyu = idy2 * (u_p[ijk-dj] - 2.0 * u_p[ijk] + u_p[ijk+dj]);  
  CCTK_REAL const dzzu = idz2 * (u_p[ijk-dk] - 2.0 * u_p[ijk] + u_p[ijk+dk]);  
 
  CCTK_REAL const uval =  
    +2.0 * u_p[ijk] - u_p_p[ijk] + dt2 * (dxxu + dyyu + dzzu);  
 
  u[ijk] = uval;  
 
} LC_ENDLOOP3(evol);

The last argument kernel is used to store the compiled kernel and associated information, so that kernels do not have to be recompiled for every call.

In this case, the actual kernel source code is contained in a source file evol.cl in thorn WaveToyOpenCL. The C string OpenCL_source_WaveToyOpenCL_evol is generated automatically by Cactus as described in the users’ guide and/or thorn OpenCL.

3 Details

3.1 Hardware information

At startup, this thorn outputs a description of the hardware (platforms and devices) available via OpenCL, i.e. CPUs and GPUs that have OpenCL drivers installed. Platforms correspond to vendors (AMD, Apple, Intel, Nvidia), devices to actual hardware (a CPU, a GPU, etc.). This information is written into a file opencl.txt in the output directory.

3.2 Device selection

At startup, this thorn selects one device of one platform that will be used later on. It chooses the first device of the first platform that matches the parameter opencl_device_type, which can be CPU, GPU, accelerator, or any.

3.3 Compiling kernels

This thorn provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don’t have to be recompiled. The compiler options for OpenCL are specified by the parameter opencl_options and enable aggressive optimisations by default, as one would want for floating-point intensive code that is not too susceptive to round-off errors.

Cactus parameter values are expanded at compile time, enabling further optimisations. (However, when a parameter value changes, the kernel is not automatically recompiled – steerable parameters are not yet supported. This would be straightforward to implement.)

Typically, OpenCL compilers can optimise more than e.g. C or Fortran compilers. The reason is that an OpenCL compiler knows the complete code – it is not possible to call routines that are defined elsewhere, or to be influenced by changes originating elsewhere (e.g. in another thread). Unfortunatelly, this does not mean that all OpenCL compilers are good at optimising – OpenCL is a fairly young language, and some of the technology is still immature.

A file containing the exact source code passed to the compiler is placed into the output directory with a name KERNELNAME.cl. A log file containing all compiler output including error messages is placed into the output directory with a name KERNELNAME.log. Both are indispensable for debugging OpenCL code.

3.4 Disassembling kernels

This thorn disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation). The disassembled output is placed into the output directory with a name KERNELNAME.s, if disassembling is supported and makes sense. (For example, object files for Nvidia GPUs contain PTX, which is essentially a high-level assembler code, and are thus not disassembled.)

By default, kernels are disassembled in the background.

3.5 Memory management

This thorn allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn Accelerator).

OpenCL devices have memory that is independent of the host memory. This is the case even when using CPUs – a particular memory region cannot be accessed by host code and by OpenCL kernels at the same time. This thorn offers several memory models (memory allocation strategies):

always-mapped:
Host and device access the memory simultaneously. This may work, but violates the OpenCL standard. Do not use this, unless you know that your implementation supports this. If this does not work, some values in memory will randomly change.
copy:
Host and device memory are allocated independently. Data will be copied. This makes sense e.g. for GPUs that have their own memory. This model also allows memory layout optimisation such as aligning grid functions with vector sizes or cache lines. Such layout optimisations are currently not supported by the Cactus flesh (but work is in progress to implement this there).
map:
Device memory is allocated such that it (likely) coincides with the memory already allocated on the host. However, either only the host or only the device can access this memory at a time; the OpenCL run-time needs to be notified to switch between these. This memory model will save space, but may be slower if host memory cannot efficiently be accessed from the device. This memory model is also not yet fully tested.

Routines may execute either on the host (regular routines) or on a device (OpenCL routines). Variables accessed (read or written) by routines may need to be copied between host and device. Thorn Accelerator keeps track of this, and notifies thorn OpenCLRunTime when data need to be copied.

Data also need to be available on the host for inter-processor synchronisation and I/O.

The parameter sync_copy_whole_buffer determines whether the whole grid function or only values on/near the boundary are copied for synchronisation.

3.6 Grid structure

This thorn offers a set of convenience macros similar to CCTK_ARGUMENTS and CCTK_PARAMETERS to access grid structure information. Currently, only a subsect of the information in cctkGH is available:

  ptrdiff_t cctk_lbnd[]  
  ptrdiff_t cctk_lsh[]  
  ptrdiff_t imin[]  
  ptrdiff_t imax[]  
  CCTK_REAL cctk_time  
  CCTK_REAL cctk_delta_time  
  CCTK_REAL cctk_origin_space[]  
  CCTK_REAL cctk_delta_space[]  
  CCTK_DELTA_TIME  
  CCTK_ORIGIN_SPACE()  
  CCTK_DELTA_SPACE()  
  CCTK_GFINDEX3D()

cctk_lbnd and cctk_lsh have the same meaning as on the host. imin and imax contain the values specified when calling OpenCLRunTime_CallKernel, and determine the loop bounds used in this kernel. The real-valued variables and their macro counterparts have the same meaning as on the host. The type of the integer fields has been changed from int to ptrdiff_t, which is a 64-bit type on 64-bit platforms, and leads to more efficient code since it avoids type conversions.

3.7 Loops

This thorn offers looping macros similar to those provided by thorn LoopControl, which parallelise loops via OpenCL’s multithreading.

The loop macros LC_LOOP3 and LC_ENDLOOP3 should be called as in the example above. The first argument defines a name for the loop, the next three arguments define the names of the iteration indices. The remaining arguments describe the loop bounds and the grid function size.

These macros need to be used. Each OpenCL thread will loop only over a part of the region described by imin and imax. If this macro is not used, OpenCL’s multithreading may be used in an inconsistent manner (unless you use OpenCL’s API to distribute the workload yourself).