Executing OpenCL kernels requires some boilerplate code: One needs to choose an OpenCL platform and device, needs to compile the code (from a C string), needs to pass in arguments, and finally needs to execute the actual kernel code. This thorn OpenCLRunTime provides a simple helper routine for these tasks.
Thorn OpenCLRunTime performs the following tasks:
At startup, it outputs a description of the hardware (platforms and devices) available via OpenCL, i.e. CPUs and GPUs that have OpenCL drivers installed.
At startup, it selects one device of one platform that will be used later on.
It provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don’t have to be recompiled.
It disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation).
It allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn Accelerator).
It offers a set of convenience macros similar to CCTK_ARGUMENTS and CCTK_PARAMETERS to access grid structure information. (Parameter values are expanded when compiling kernels, often enabling additional optimisations.)
It offers looping macros similar to those provided by thorn LoopControl, which parallelise loops via OpenCL’s multithreading.
It offers datatypes and macros for easy vectorisation, based on OpenCL’s vector types.
At the moment, OpenCLRunTime only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. (The main reason for this is that the cctkGH entries on the device are not updated; however, this should be straightforward to implement.)
An OpenCL compute kernel may be called as follows:
char const *const groups[] = { "WaveToyOpenCL::Scalar", NULL}; int const imin[] = {cctk_nghostzones[0], cctk_nghostzones[1], cctk_nghostzones[2]}; int const imax[] = {cctk_lsh[0] - cctk_nghostzones[0], cctk_lsh[1] - cctk_nghostzones[1], cctk_lsh[2] - cctk_nghostzones[2]}; static struct OpenCLKernel *kernel = NULL; char const *const sources[] = {"", OpenCL_source_WaveToyOpenCL_evol, NULL}; OpenCLRunTime_CallKernel(cctkGH, CCTK_THORNSTRING, "evol", sources, groups, NULL, NULL, NULL, -1, imin, imax, &kernel);
The array groups specifies which grid functions are to be available in the OpenCL kernel. This is a C array terminated by NULL. (This information could instead also be gathered from the respective schedule.ccl declarations.)
The integer arrays imin and imax specify the iteration bounds of the kernel. This information is necessary so that OpenCL can properly map this iteration space onto the available OpenCL groups (threads).
The array sources (a C array terminated by NULL) specifies the actual source code for the kernel. The first string (here empty) can contain declarations and definitions that should be available outside the kernel function. The second string specifies the actual kernel code, excluding the actual function declaration which is inserted automatically. This is an example for such a kernel code:
// Grid points are index in the same way as for a CPU // Using ptrdiff_t instead of int is more efficient on 64-bit // architectures ptrdiff_t const di = 1; ptrdiff_t const dj = CCTK_GFINDEX3D(cctkGH,0,1,0) - CCTK_GFINDEX3D(cctkGH,0,0,0); ptrdiff_t const dk = CCTK_GFINDEX3D(cctkGH,0,0,1) - CCTK_GFINDEX3D(cctkGH,0,0,0); // Coordinates are calculated in the same as as for a CPU CCTK_REAL const idx2 = 1.0 / pown(CCTK_DELTA_SPACE(0), 2); CCTK_REAL const idy2 = 1.0 / pown(CCTK_DELTA_SPACE(1), 2); CCTK_REAL const idz2 = 1.0 / pown(CCTK_DELTA_SPACE(2), 2); CCTK_REAL const dt2 = pown(CCTK_DELTA_TIME, 2); // Note: The kernel below is not vectorised (since it doesn’t use // CCTK_REAL_VEC). Therefore, vectorisation must be switched off in // the paramter file (via OpenCLRunTime::vector_size_x = 1). // This loop macro automatically parallelizes the code // imin[] and imax[] are passed from the host LC_LOOP3(evol, i,j,k, imin[0],imin[1],imin[2], imax[0],imax[1],imax[2], cctk_lsh[0],cctk_lsh[1],cctk_lsh[2]) { // Calculate index of current point ptrdiff_t const ijk = di*i + dj*j + dk*k; CCTK_REAL const dxxu = idx2 * (u_p[ijk-di] - 2.0 * u_p[ijk] + u_p[ijk+di]); CCTK_REAL const dyyu = idy2 * (u_p[ijk-dj] - 2.0 * u_p[ijk] + u_p[ijk+dj]); CCTK_REAL const dzzu = idz2 * (u_p[ijk-dk] - 2.0 * u_p[ijk] + u_p[ijk+dk]); CCTK_REAL const uval = +2.0 * u_p[ijk] - u_p_p[ijk] + dt2 * (dxxu + dyyu + dzzu); u[ijk] = uval; } LC_ENDLOOP3(evol);
The last argument kernel is used to store the compiled kernel and associated information, so that kernels do not have to be recompiled for every call.
In this case, the actual kernel source code is contained in a source file evol.cl in thorn WaveToyOpenCL. The C string OpenCL_source_WaveToyOpenCL_evol is generated automatically by Cactus as described in the users’ guide and/or thorn OpenCL.
At startup, this thorn outputs a description of the hardware (platforms and devices) available via OpenCL, i.e. CPUs and GPUs that have OpenCL drivers installed. Platforms correspond to vendors (AMD, Apple, Intel, Nvidia), devices to actual hardware (a CPU, a GPU, etc.). This information is written into a file opencl.txt in the output directory.
At startup, this thorn selects one device of one platform that will be used later on. It chooses the first device of the first platform that matches the parameter opencl_device_type, which can be CPU, GPU, accelerator, or any.
This thorn provides an API that can be used to compile kernels for this device, and which remembers previously compiled kernels so that they don’t have to be recompiled. The compiler options for OpenCL are specified by the parameter opencl_options and enable aggressive optimisations by default, as one would want for floating-point intensive code that is not too susceptive to round-off errors.
Cactus parameter values are expanded at compile time, enabling further optimisations. (However, when a parameter value changes, the kernel is not automatically recompiled – steerable parameters are not yet supported. This would be straightforward to implement.)
Typically, OpenCL compilers can optimise more than e.g. C or Fortran compilers. The reason is that an OpenCL compiler knows the complete code – it is not possible to call routines that are defined elsewhere, or to be influenced by changes originating elsewhere (e.g. in another thread). Unfortunatelly, this does not mean that all OpenCL compilers are good at optimising – OpenCL is a fairly young language, and some of the technology is still immature.
A file containing the exact source code passed to the compiler is placed into the output directory with a name KERNELNAME.cl. A log file containing all compiler output including error messages is placed into the output directory with a name KERNELNAME.log. Both are indispensable for debugging OpenCL code.
This thorn disassembles the compiled kernels so that they can be examined (which would otherwise be difficult in an environment using dynamic compilation). The disassembled output is placed into the output directory with a name KERNELNAME.s, if disassembling is supported and makes sense. (For example, object files for Nvidia GPUs contain PTX, which is essentially a high-level assembler code, and are thus not disassembled.)
By default, kernels are disassembled in the background.
This thorn allocates storage for grid functions on the device, and handles copying data to and from the device (in interaction with thorn Accelerator).
OpenCL devices have memory that is independent of the host memory. This is the case even when using CPUs – a particular memory region cannot be accessed by host code and by OpenCL kernels at the same time. This thorn offers several memory models (memory allocation strategies):
always-mapped:
Host and device access the memory simultaneously. This may work, but violates the OpenCL standard. Do not use this, unless you know that your implementation supports this. If this does not work, some values in memory will randomly change.
copy:
Host and device memory are allocated independently. Data will be copied. This makes sense e.g. for GPUs that have their own memory. This model also allows memory layout optimisation such as aligning grid functions with vector sizes or cache lines. Such layout optimisations are currently not supported by the Cactus flesh (but work is in progress to implement this there).
map:
Device memory is allocated such that it (likely) coincides with the memory already allocated on the host. However, either only the host or only the device can access this memory at a time; the OpenCL run-time needs to be notified to switch between these. This memory model will save space, but may be slower if host memory cannot efficiently be accessed from the device. This memory model is also not yet fully tested.
Routines may execute either on the host (regular routines) or on a device (OpenCL routines). Variables accessed (read or written) by routines may need to be copied between host and device. Thorn Accelerator keeps track of this, and notifies thorn OpenCLRunTime when data need to be copied.
Data also need to be available on the host for inter-processor synchronisation and I/O.
The parameter sync_copy_whole_buffer determines whether the whole grid function or only values on/near the boundary are copied for synchronisation.
This thorn offers a set of convenience macros similar to CCTK_ARGUMENTS and CCTK_PARAMETERS to access grid structure information. Currently, only a subsect of the information in cctkGH is available:
ptrdiff_t cctk_lbnd[] ptrdiff_t cctk_lsh[] ptrdiff_t imin[] ptrdiff_t imax[] CCTK_REAL cctk_time CCTK_REAL cctk_delta_time CCTK_REAL cctk_origin_space[] CCTK_REAL cctk_delta_space[] CCTK_DELTA_TIME CCTK_ORIGIN_SPACE() CCTK_DELTA_SPACE() CCTK_GFINDEX3D()
cctk_lbnd and cctk_lsh have the same meaning as on the host. imin and imax contain the values specified when calling OpenCLRunTime_CallKernel, and determine the loop bounds used in this kernel. The real-valued variables and their macro counterparts have the same meaning as on the host. The type of the integer fields has been changed from int to ptrdiff_t, which is a 64-bit type on 64-bit platforms, and leads to more efficient code since it avoids type conversions.
This thorn offers looping macros similar to those provided by thorn LoopControl, which parallelise loops via OpenCL’s multithreading.
The loop macros LC_LOOP3 and LC_ENDLOOP3 should be called as in the example above. The first argument defines a name for the loop, the next three arguments define the names of the iteration indices. The remaining arguments describe the loop bounds and the grid function size.
These macros need to be used. Each OpenCL thread will loop only over a part of the region described by imin and imax. If this macro is not used, OpenCL’s multithreading may be used in an inconsistent manner (unless you use OpenCL’s API to distribute the workload yourself).
OpenCL supports vector data types. Using such vector data types is important to achieve good performance on CPUs. This thorn provides macros, in particular CCTK_REAL_VEC, that can be used for this. Unfortunately, vectorisation has to be performed explicitly by the kernel writer, and is not performed by this thorn. (However, note that some OpenCL compilers can vectorise code automatically.)
When vectorising code explicitly, one needs to use special instructions to load and store values from and to memory. This is not (yet) described here; however, the macros are similar to those offered by thorn Vectors. At the moment, these vectorisation capabilities are targeted for automated code generation (e.g. by Kranc) rather than for manual programming.
disassemble_in_background | Scope: private | BOOLEAN |
Description: Disassemble in the background (using fork)
| ||
Default: yes | ||
disassemble_kernels | Scope: private | BOOLEAN |
Description: Disassemble kernels
| ||
Default: yes | ||
group_size_x | Scope: private | INT |
Description: Group size in x direction
| ||
Range | Default: 1 | |
1:* | ||
group_size_y | Scope: private | INT |
Description: Group size in y direction
| ||
Range | Default: 1 | |
1:* | ||
group_size_z | Scope: private | INT |
Description: Group size in z direction
| ||
Range | Default: 1 | |
1:* | ||
memory_model | Scope: private | KEYWORD |
Description: Memory model
| ||
Range | Default: copy | |
always-mapped | CPU and GPU use the same memory (may violate standard)
| |
copy | Copy buffers
| |
map | Map buffers (requires same layout)
| |
opencl_device_type | Scope: private | KEYWORD |
Description: Device type
| ||
Range | Default: CPU | |
CPU | ||
GPU | ||
accelerator | ||
any | ||
opencl_options | Scope: private | STRING |
Description: OpenCL compiler options
| ||
Range | Default: -cl-mad-enable -cl-no-signed-zeros -cl-fast-relaxed-math | |
sync_copy_whole_buffer | Scope: private | BOOLEAN |
Description: Copy whole buffer before/after syncing
| ||
Default: no | ||
tile_size_x | Scope: private | INT |
Description: Tile size in x direction
| ||
Range | Default: 1 | |
1:* | ||
tile_size_y | Scope: private | INT |
Description: Tile size in y direction
| ||
Range | Default: 1 | |
1:* | ||
tile_size_z | Scope: private | INT |
Description: Tile size in z direction
| ||
Range | Default: 1 | |
1:* | ||
unroll_size_x | Scope: private | INT |
Description: Unroll size in x direction
| ||
Range | Default: 1 | |
1:* | ||
unroll_size_y | Scope: private | INT |
Description: Unroll size in y direction
| ||
Range | Default: 1 | |
1:* | ||
unroll_size_z | Scope: private | INT |
Description: Unroll size in z direction
| ||
Range | Default: 1 | |
1:* | ||
vector_size_x | Scope: private | INT |
Description: Vector size in x direction
| ||
Range | Default: (none) | |
use preferred vector size
| ||
1:* | ||
vector_size_y | Scope: private | INT |
Description: Vector size in y direction
| ||
Range | Default: 1 | |
1 | ||
vector_size_z | Scope: private | INT |
Description: Vector size in z direction
| ||
Range | Default: 1 | |
1 | ||
verbose | Scope: private | BOOLEAN |
Description: Output detailed device information
| ||
Default: no | ||
veryverbose | Scope: private | BOOLEAN |
Description: Output even more detailed information
| ||
Default: no | ||
out_dir | Scope: shared from IO | STRING |
Implements:
openclruntime
Adds header:
OpenCLRunTime.h
Uses header:
carpet.hh
vectors.h
Provides:
Device_CreateVariables to
Device_CopyCycle to
Device_CopyFromPast to
Device_CopyToDevice to
Device_CopyToHost to
Device_CopyPreSync to
Device_CopyPostSync to
LinearCombination to
This section lists all the variables which are assigned storage by thorn CactusUtils/OpenCLRunTime. Storage can either last for the duration of the run (Always means that if this thorn is activated storage will be assigned, Conditional means that if this thorn is activated storage will be assigned for the duration of the run if some condition is met), or can be turned on for the duration of a schedule function.
NONE
CCTK_STARTUP
openclruntime_setup
set up opencl device
Language: | c | |
Type: | function | |
CCTK_WRAGH
openclruntime_deviceinfo
output opencl device information
Language: | c | |
Type: | function | |
CCTK_WRAGH
openclruntime_autoconf
determine whether certain features are supported
After: | openclruntime_deviceinfo | |
Language: | c | |
Type: | function | |
CCTK_BASEGRID
openclruntime_setupdevicegh
set up device grid structure
After: | spatialspacings | |
temporalspacings | ||
Before: | spatialcoordinates | |
Language: | c | |
Type: | function | |
CCTK_TERMINATE
openclruntime_statistics
output profiling information
Language: | c | |
Options: | global | |
Type: | function | |