Erik Schnetter <>

May 17, 2012


Many accelerators and GPU architectures separate device memory from host memory. While device memory may be directly accessible from the host and vice versa, this may include a significant performance penalty. Such architectures require transferring data between host and device, depending on which routines execute where, and depending on which data these routines read and write.

Thorn Accelerator keeps track of which data are valid where (host or device), and schedules the necessary data transfers.

1 Overview

This thorn keeps track of which grid variables are valid where (host or device). Grid variables can be valid either nowhere (e.g. in the beginning), only on the host or only on the device (e.g. if the variable was set by a routine executing there), or both on host and device (e.g. if the variable was set on the device and then copied to the host).

This thorn also examines the schedule to determine which data are required where, and initiates he necessary copy operations.

2 Details

2.1 Scheduled Routines

2.1.1 Preparation

Just before a scheduled routine executes, this thorn examines the schedule to determine where the routine executes: a routine tagged Device=1 executed on a device, a routine without tag or tagged Device=0 executed on the host.

The schedule READS declarations for this routine in the schedule.ccl file (see users’ guide). These declarations specify which grid variables will be read by this routine.

Accelerator then ensures that these grid variables have valid data where the routine executes (host or device). If necessary, data are copied. Accelerator does not perform the necessary copy operations itself; these are performed by lower-level accelerator-architecture aware thorns such as e.g. OpenCLRunTime.

The parameter only_reads_current_timelevel can be used to indicate that all routines only ever read the current timelevel. This may e.g. true for thorns using MoL for time integration.

2.1.2 Bookkeeping

Just after a scheduled routine has finished, this thorn examines the WRITES declarations for that routine, and marks these grid variables as valid on the device and invalid on the host if the routine executed on the device, and vice versa.

The parameter only_writes_current_timelevel can be used to indicate that all routines only ever write the current timelevel. This may e.g. true for thorns using MoL for time integration. However, this may not be true for routines setting up initial data, if these routines set up multiple timelevels.

2.2 Synchronisation

Synchronisation is always performed on the host.

Before synchronising, this thorn ensures that all data that are sent to neighbouring processors during synchonisation are valid on the host, by copying these to the host if necessary. Only the data actually necessary for synchronisation are copied.

After synchronising, this thorn copies ghost zone values from the host to the device, for those grid variable for which the device already has otherwise valid data.

2.3 I/O

I/O is always performed on the host.

Before performing I/O, this thorn copies data from the device to the host. Since I/O routines do not (yet?) declare which variables they access, this thorn has to copy back all variables to the host.

This may be expensive, and the parameters copy_back_every, copy_back_vars, and copy_back_all_timelevels can be used to copy only certain data to the host. It is up to the user to ensure that these settings are consistent with the I/O settings that select which variables are output when.

2.4 Timelevel Cycling

When the driver cycles time levels on the host, this thorn cycles time levels on the device as well, and then marks the new current timelevel as invalid on both host and device.

3 Implementation Details

Accelerator expects to be called by the driver on certain occasions. It provides a set of aliased functions for this that the driver should call. Other infrastructure thorns may also need to call these routines.

Cycle timelevels of all grid variables
Copy the past timelevel of some grid variables to the current timelevel (used by MoL at the beginning of each integration step)
Must be called just before synchronising
Must be called just after synchronising
Must be called just before calling a scheduled routine
Must be called just after calling a scheduled routine
Tell Accelerator that certain grid functions were modified, and are thus valid somwhere and invalid elsewhere
Tell Accelerator that certain grid functions will be accessed write-only somewhere
Ask Accelerator to ensure that certain grid functions are valid somewhere (by copying data if necessary)

Accelerator expects another thorn to provide the actual device-specific copy operations, e.g. for an OpenCL or CUDA device. These functions must be provided:

Create (i.e. begin to track) a set of variables/timelevels. This routine will be called exactly once before a variable/timelevel combination is mentioned to the device.
Cycle all timelevels on the device. (This routine should not block.)
Copy past to current time level on the device. (This routine should not block.)