## OpenCLRunTime

May 17, 2012

### Abstract

Executing OpenCL kernels requires some boilerplate code: One needs to choose an OpenCL platform and device, needs to compile the code (from a C string), needs to pass in arguments, and finally needs to execute the actual kernel code. This thorn OpenCLRunTime provides a simple helper routine for these tasks.

### 1 Overview

Thorn OpenCLRunTime performs the following tasks:

• At startup, it outputs a description of the hardware (platforms and devices) available via OpenCL, i.e. CPUs and GPUs that have OpenCL drivers installed.

• At startup, it selects one device of one platform that will be used later on.

• It provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don’t have to be recompiled.

• It disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation).

• It allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn Accelerator).

• It offers a set of convenience macros similar to CCTK_ARGUMENTS and CCTK_PARAMETERS to access grid structure information. (Parameter values are expanded when compiling kernels, often enabling additional optimisations.)

• It offers looping macros similar to those provided by thorn LoopControl, which parallelise loops via OpenCL’s multithreading.

• It offers datatypes and macros for easy vectorisation, based on OpenCL’s vector types.

At the moment, OpenCLRunTime only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. (The main reason for this is that the cctkGH entries on the device are not updated; however, this should be straightforward to implement.)

### 2 Example

An OpenCL compute kernel may be called as follows:

  char const *const groups[] = {
"WaveToyOpenCL::Scalar",
NULL};

int const imin[] = {cctk_nghostzones[0],
cctk_nghostzones[1],
cctk_nghostzones[2]};
int const imax[] = {cctk_lsh[0] - cctk_nghostzones[0],
cctk_lsh[1] - cctk_nghostzones[1],
cctk_lsh[2] - cctk_nghostzones[2]};

static struct OpenCLKernel *kernel = NULL;
char const *const sources[] = {"", OpenCL_source_WaveToyOpenCL_evol, NULL};
OpenCLRunTime_CallKernel(cctkGH, CCTK_THORNSTRING, "evol",
sources, groups, NULL, NULL, NULL, -1,
imin, imax, &kernel);


The array groups specifies which grid functions are to be available in the OpenCL kernel. This is a C array terminated by NULL. (This information could instead also be gathered from the respective schedule.ccl declarations.)

The integer arrays imin and imax specify the iteration bounds of the kernel. This information is necessary so that OpenCL can properly map this iteration space onto the available OpenCL groups (threads).

The array sources (a C array terminated by NULL) specifies the actual source code for the kernel. The first string (here empty) can contain declarations and definitions that should be available outside the kernel function. The second string specifies the actual kernel code, excluding the actual function declaration which is inserted automatically. This is an example for such a kernel code:

// Grid points are index in the same way as for a CPU
// Using ptrdiff_t instead of int is more efficient on 64-bit
// architectures
ptrdiff_t const di = 1;
ptrdiff_t const dj =
CCTK_GFINDEX3D(cctkGH,0,1,0) - CCTK_GFINDEX3D(cctkGH,0,0,0);
ptrdiff_t const dk =
CCTK_GFINDEX3D(cctkGH,0,0,1) - CCTK_GFINDEX3D(cctkGH,0,0,0);

// Coordinates are calculated in the same as as for a CPU
CCTK_REAL const idx2 = 1.0 / pown(CCTK_DELTA_SPACE(0), 2);
CCTK_REAL const idy2 = 1.0 / pown(CCTK_DELTA_SPACE(1), 2);
CCTK_REAL const idz2 = 1.0 / pown(CCTK_DELTA_SPACE(2), 2);
CCTK_REAL const dt2  = pown(CCTK_DELTA_TIME, 2);

// Note: The kernel below is not vectorised (since it doesn’t use
// CCTK_REAL_VEC). Therefore, vectorisation must be switched off in
// the paramter file (via OpenCLRunTime::vector_size_x = 1).

// This loop macro automatically parallelizes the code
// imin[] and imax[] are passed from the host
LC_LOOP3(evol,
i,j,k, imin[0],imin[1],imin[2], imax[0],imax[1],imax[2],
cctk_lsh[0],cctk_lsh[1],cctk_lsh[2])
{
// Calculate index of current point
ptrdiff_t const ijk = di*i + dj*j + dk*k;

CCTK_REAL const dxxu = idx2 * (u_p[ijk-di] - 2.0 * u_p[ijk] + u_p[ijk+di]);
CCTK_REAL const dyyu = idy2 * (u_p[ijk-dj] - 2.0 * u_p[ijk] + u_p[ijk+dj]);
CCTK_REAL const dzzu = idz2 * (u_p[ijk-dk] - 2.0 * u_p[ijk] + u_p[ijk+dk]);

CCTK_REAL const uval =
+2.0 * u_p[ijk] - u_p_p[ijk] + dt2 * (dxxu + dyyu + dzzu);

u[ijk] = uval;

} LC_ENDLOOP3(evol);


The last argument kernel is used to store the compiled kernel and associated information, so that kernels do not have to be recompiled for every call.

In this case, the actual kernel source code is contained in a source file evol.cl in thorn WaveToyOpenCL. The C string OpenCL_source_WaveToyOpenCL_evol is generated automatically by Cactus as described in the users’ guide and/or thorn OpenCL.

### 3 Details

#### 3.1 Hardware information

At startup, this thorn outputs a description of the hardware (platforms and devices) available via OpenCL, i.e. CPUs and GPUs that have OpenCL drivers installed. Platforms correspond to vendors (AMD, Apple, Intel, Nvidia), devices to actual hardware (a CPU, a GPU, etc.). This information is written into a file opencl.txt in the output directory.

#### 3.2 Device selection

At startup, this thorn selects one device of one platform that will be used later on. It chooses the first device of the first platform that matches the parameter opencl_device_type, which can be CPU, GPU, accelerator, or any.

#### 3.3 Compiling kernels

This thorn provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don’t have to be recompiled. The compiler options for OpenCL are specified by the parameter opencl_options and enable aggressive optimisations by default, as one would want for floating-point intensive code that is not too susceptive to round-off errors.

Cactus parameter values are expanded at compile time, enabling further optimisations. (However, when a parameter value changes, the kernel is not automatically recompiled – steerable parameters are not yet supported. This would be straightforward to implement.)

Typically, OpenCL compilers can optimise more than e.g. C or Fortran compilers. The reason is that an OpenCL compiler knows the complete code – it is not possible to call routines that are defined elsewhere, or to be influenced by changes originating elsewhere (e.g. in another thread). Unfortunatelly, this does not mean that all OpenCL compilers are good at optimising – OpenCL is a fairly young language, and some of the technology is still immature.

A file containing the exact source code passed to the compiler is placed into the output directory with a name KERNELNAME.cl. A log file containing all compiler output including error messages is placed into the output directory with a name KERNELNAME.log. Both are indispensable for debugging OpenCL code.

#### 3.4 Disassembling kernels

This thorn disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation). The disassembled output is placed into the output directory with a name KERNELNAME.s, if disassembling is supported and makes sense. (For example, object files for Nvidia GPUs contain PTX, which is essentially a high-level assembler code, and are thus not disassembled.)

By default, kernels are disassembled in the background.

#### 3.5 Memory management

This thorn allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn Accelerator).

OpenCL devices have memory that is independent of the host memory. This is the case even when using CPUs – a particular memory region cannot be accessed by host code and by OpenCL kernels at the same time. This thorn offers several memory models (memory allocation strategies):

always-mapped:

Host and device access the memory simultaneously. This may work, but violates the OpenCL standard. Do not use this, unless you know that your implementation supports this. If this does not work, some values in memory will randomly change.

copy:

Host and device memory are allocated independently. Data will be copied. This makes sense e.g. for GPUs that have their own memory. This model also allows memory layout optimisation such as aligning grid functions with vector sizes or cache lines. Such layout optimisations are currently not supported by the Cactus flesh (but work is in progress to implement this there).

map:

Device memory is allocated such that it (likely) coincides with the memory already allocated on the host. However, either only the host or only the device can access this memory at a time; the OpenCL run-time needs to be notified to switch between these. This memory model will save space, but may be slower if host memory cannot efficiently be accessed from the device. This memory model is also not yet fully tested.

Routines may execute either on the host (regular routines) or on a device (OpenCL routines). Variables accessed (read or written) by routines may need to be copied between host and device. Thorn Accelerator keeps track of this, and notifies thorn OpenCLRunTime when data need to be copied.

Data also need to be available on the host for inter-processor synchronisation and I/O.

The parameter sync_copy_whole_buffer determines whether the whole grid function or only values on/near the boundary are copied for synchronisation.

#### 3.6 Grid structure

This thorn offers a set of convenience macros similar to CCTK_ARGUMENTS and CCTK_PARAMETERS to access grid structure information. Currently, only a subsect of the information in cctkGH is available:

  ptrdiff_t cctk_lbnd[]
ptrdiff_t cctk_lsh[]
ptrdiff_t imin[]
ptrdiff_t imax[]
CCTK_REAL cctk_time
CCTK_REAL cctk_delta_time
CCTK_REAL cctk_origin_space[]
CCTK_REAL cctk_delta_space[]
CCTK_DELTA_TIME
CCTK_ORIGIN_SPACE()
CCTK_DELTA_SPACE()
CCTK_GFINDEX3D()


cctk_lbnd and cctk_lsh have the same meaning as on the host. imin and imax contain the values specified when calling OpenCLRunTime_CallKernel, and determine the loop bounds used in this kernel. The real-valued variables and their macro counterparts have the same meaning as on the host. The type of the integer fields has been changed from int to ptrdiff_t, which is a 64-bit type on 64-bit platforms, and leads to more efficient code since it avoids type conversions.

#### 3.7 Loops

This thorn offers looping macros similar to those provided by thorn LoopControl, which parallelise loops via OpenCL’s multithreading.

The loop macros LC_LOOP3 and LC_ENDLOOP3 should be called as in the example above. The first argument defines a name for the loop, the next three arguments define the names of the iteration indices. The remaining arguments describe the loop bounds and the grid function size.

These macros need to be used. Each OpenCL thread will loop only over a part of the region described by imin and imax. If this macro is not used, OpenCL’s multithreading may be used in an inconsistent manner (unless you use OpenCL’s API to distribute the workload yourself).

#### 3.8 Vectorisation

OpenCL supports vector data types. Using such vector data types is important to achieve good performance on CPUs. This thorn provides macros, in particular CCTK_REAL_VEC, that can be used for this. Unfortunately, vectorisation has to be performed explicitly by the kernel writer, and is not performed by this thorn. (However, note that some OpenCL compilers can vectorise code automatically.)

When vectorising code explicitly, one needs to use special instructions to load and store values from and to memory. This is not (yet) described here; however, the macros are similar to those offered by thorn Vectors. At the moment, these vectorisation capabilities are targeted for automated code generation (e.g. by Kranc) rather than for manual programming.

### 4 Parameters

 disassemble_in_background Scope: private BOOLEAN Description: Disassemble in the background (using fork) Default: yes

 disassemble_kernels Scope: private BOOLEAN Description: Disassemble kernels Default: yes

 group_size_x Scope: private INT Description: Group size in x direction Range Default: 1 1:*

 group_size_y Scope: private INT Description: Group size in y direction Range Default: 1 1:*

 group_size_z Scope: private INT Description: Group size in z direction Range Default: 1 1:*

 memory_model Scope: private KEYWORD Description: Memory model Range Default: copy always-mapped CPU and GPU use the same memory (may violate standard) copy Copy buffers map Map buffers (requires same layout)

 opencl_device_type Scope: private KEYWORD Description: Device type Range Default: CPU CPU GPU accelerator any

 opencl_options Scope: private STRING Description: OpenCL compiler options Range Default: -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math

 sync_copy_whole_buffer Scope: private BOOLEAN Description: Copy whole buffer before/after syncing Default: no

 tile_size_x Scope: private INT Description: Tile size in x direction Range Default: 1 1:*

 tile_size_y Scope: private INT Description: Tile size in y direction Range Default: 1 1:*

 tile_size_z Scope: private INT Description: Tile size in z direction Range Default: 1 1:*

 unroll_size_x Scope: private INT Description: Unroll size in x direction Range Default: 1 1:*

 unroll_size_y Scope: private INT Description: Unroll size in y direction Range Default: 1 1:*

 unroll_size_z Scope: private INT Description: Unroll size in z direction Range Default: 1 1:*

 vector_size_x Scope: private INT Description: Vector size in x direction Range Default: (none) use preferred vector size 1:*

 vector_size_y Scope: private INT Description: Vector size in y direction Range Default: 1 1

 vector_size_z Scope: private INT Description: Vector size in z direction Range Default: 1 1

 verbose Scope: private BOOLEAN Description: Output detailed device information Default: no

 veryverbose Scope: private BOOLEAN Description: Output even more detailed information Default: no

 out_dir Scope: shared from IO STRING

### 5 Interfaces

#### General

Implements:

openclruntime

OpenCLRunTime.h

carpet.hh

vectors.h

Provides:

Device_CreateVariables to

Device_CopyCycle to

Device_CopyFromPast to

Device_CopyToDevice to

Device_CopyToHost to

Device_CopyPreSync to

Device_CopyPostSync to

LinearCombination to

### 6 Schedule

This section lists all the variables which are assigned storage by thorn CactusUtils/OpenCLRunTime. Storage can either last for the duration of the run (Always means that if this thorn is activated storage will be assigned, Conditional means that if this thorn is activated storage will be assigned for the duration of the run if some condition is met), or can be turned on for the duration of a schedule function.

NONE

#### Scheduled Functions

CCTK_STARTUP

openclruntime_setup

set up opencl device

 Language: c Type: function

CCTK_WRAGH

openclruntime_deviceinfo

output opencl device information

 Language: c Type: function

CCTK_WRAGH

openclruntime_autoconf

determine whether certain features are supported

 After: openclruntime_deviceinfo Language: c Type: function

CCTK_BASEGRID

openclruntime_setupdevicegh

set up device grid structure

 After: spatialspacings temporalspacings Before: spatialcoordinates Language: c Type: function

CCTK_TERMINATE

openclruntime_statistics

output profiling information

 Language: c Options: global Type: function