Accelerator

Erik Schnetter <eschnetter@perimeterinstitute.ca>

May 17, 2012

Abstract

Many accelerators and GPU architectures separate device memory from host memory. While device memory may be directly accessible from the host and vice versa, this may include a significant performance penalty. Such architectures require transferring data between host and device, depending on which routines execute where, and depending on which data these routines read and write.

Thorn Accelerator keeps track of which data are valid where (host or device), and schedules the necessary data transfers.

1 Overview

This thorn keeps track of which grid variables are valid where (host or device). Grid variables can be valid either nowhere (e.g. in the beginning), only on the host or only on the device (e.g. if the variable was set by a routine executing there), or both on host and device (e.g. if the variable was set on the device and then copied to the host).

This thorn also examines the schedule to determine which data are required where, and initiates he necessary copy operations.

2 Details

2.1 Scheduled Routines

2.1.1 Preparation

Just before a scheduled routine executes, this thorn examines the schedule to determine where the routine executes: a routine tagged Device=1 executed on a device, a routine without tag or tagged Device=0 executed on the host.

The schedule READS declarations for this routine in the schedule.ccl file (see users’ guide). These declarations specify which grid variables will be read by this routine.

Accelerator then ensures that these grid variables have valid data where the routine executes (host or device). If necessary, data are copied. Accelerator does not perform the necessary copy operations itself; these are performed by lower-level accelerator-architecture aware thorns such as e.g. OpenCLRunTime.

The parameter only_reads_current_timelevel can be used to indicate that all routines only ever read the current timelevel. This may e.g. true for thorns using MoL for time integration.

2.1.2 Bookkeeping

Just after a scheduled routine has finished, this thorn examines the WRITES declarations for that routine, and marks these grid variables as valid on the device and invalid on the host if the routine executed on the device, and vice versa.

The parameter only_writes_current_timelevel can be used to indicate that all routines only ever write the current timelevel. This may e.g. true for thorns using MoL for time integration. However, this may not be true for routines setting up initial data, if these routines set up multiple timelevels.

2.2 Synchronisation

Synchronisation is always performed on the host.

Before synchronising, this thorn ensures that all data that are sent to neighbouring processors during synchonisation are valid on the host, by copying these to the host if necessary. Only the data actually necessary for synchronisation are copied.

After synchronising, this thorn copies ghost zone values from the host to the device, for those grid variable for which the device already has otherwise valid data.

2.3 I/O

I/O is always performed on the host.

Before performing I/O, this thorn copies data from the device to the host. Since I/O routines do not (yet?) declare which variables they access, this thorn has to copy back all variables to the host.

This may be expensive, and the parameters copy_back_every, copy_back_vars, and copy_back_all_timelevels can be used to copy only certain data to the host. It is up to the user to ensure that these settings are consistent with the I/O settings that select which variables are output when.

2.4 Timelevel Cycling

When the driver cycles time levels on the host, this thorn cycles time levels on the device as well, and then marks the new current timelevel as invalid on both host and device.

3 Implementation Details

Accelerator expects to be called by the driver on certain occasions. It provides a set of aliased functions for this that the driver should call. Other infrastructure thorns may also need to call these routines.

Accelerator_Cycle
Cycle timelevels of all grid variables
Accelerator_CopyFromPast
Copy the past timelevel of some grid variables to the current timelevel (used by MoL at the beginning of each integration step)
Accelerator_PreSync
Must be called just before synchronising
Accelerator_PostSync
Must be called just after synchronising
Accelerator_PreCallFunction
Must be called just before calling a scheduled routine
Accelerator_PostCallFunction
Must be called just after calling a scheduled routine
Accelerator_NotifyDataModified
Tell Accelerator that certain grid functions were modified, and are thus valid somwhere and invalid elsewhere
Accelerator_RequireInvalidData
Tell Accelerator that certain grid functions will be accessed write-only somewhere
Accelerator_RequireValidData
Ask Accelerator to ensure that certain grid functions are valid somewhere (by copying data if necessary)

Accelerator expects another thorn to provide the actual device-specific copy operations, e.g. for an OpenCL or CUDA device. These functions must be provided:

Device_CreateVariables
Create (i.e. begin to track) a set of variables/timelevels. This routine will be called exactly once before a variable/timelevel combination is mentioned to the device.
Device_CopyCycle
Cycle all timelevels on the device. (This routine should not block.)
Device_CopyFromPast
Copy past to current time level on the device. (This routine should not block.)
Device_CopyToDevice
Copy data from the host to the device. The return argument indicates whether the data have been copied or moved (i.e. are now invalid on the host). (This routine should not block.)
Device_CopyToHost
Copy data from the device back to the host. The return argument indicates whether the data have been copied or moved. (This routine should block until all data have been copied.)
Device_CopyPreSync
Copy those data from the device back to the host that will be needed for synchronization (i.e. for inter-process synchronization; AMR is not yet supported). (This routine should block until all data have been copied.)
Device_CopyPostSync
Copy the ghost zones to the device. (This routine should not block.)

The note “should not block” indicates that the respective data transfer operation should be performed in the background if possible; it is not necessary to wait until the data transfer has finished before returning. However, “should block” indicates that the routine must wait until all data have been transferred.

4 Restrictions

At the moment, Accelerator only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. The main reason for this is that it does not keep track of the additional metadata – this would be straightforward to add.

Accelerator is currently also tied to using Carpet as driver and does e.g. not work with PUGH. The main reason for this is that PUGH and the flesh do not provide the hooks necessary for Accelerator to work – these would also be straightforward to add.

5 Parameters




copy_back_all_timelevels
Scope: private  BOOLEAN



Description: Copy all timelevels back to the host for output



  Default: no






copy_back_all_written_variables_in_analysis
Scope: private  BOOLEAN



Description: Copy all variables that are written to in analysis back to the host



  Default: yes






copy_back_every
Scope: private  INT



Description: When to copy variables back to the host



Range   Default: 1
never
1:*
every so many iterations






copy_back_vars
Scope: private  STRING



Description: Which variables to copy back



Range   Default: all
.*
list of group or variable names; empty to copy nothing, ’all’ to copy all variables






verbose
Scope: private  BOOLEAN



Description: Output detailed information



  Default: no






veryverbose
Scope: private  BOOLEAN



Description: Output even more detailed information



  Default: no






only_reads_current_timelevel
Scope: restricted  BOOLEAN



Description: Assume that functions read only the current timelevel



  Default: no






only_writes_current_timelevel
Scope: restricted  BOOLEAN



Description: Assume that functions write only the current timelevel



  Default: no



6 Interfaces

General

Implements:

accelerator

Uses header:

carpet.hh

Provides:

Accelerator_Cycle to

Accelerator_CopyFromPast to

Accelerator_PreSync to

Accelerator_PostSync to

Accelerator_PreCallFunction to

Accelerator_PostCallFunction to

Accelerator_NotifyDataModified to

Accelerator_RequireInvalidData to

Accelerator_RequireValidData to

7 Schedule

This section lists all the variables which are assigned storage by thorn CactusUtils/Accelerator. Storage can either last for the duration of the run (Always means that if this thorn is activated storage will be assigned, Conditional means that if this thorn is activated storage will be assigned for the duration of the run if some condition is met), or can be turned on for the duration of a schedule function.

Storage

NONE

Scheduled Functions

CCTK_STARTUP

  accelerator_init

  initialise accelerator thorn

 

 Language:c
 Type: function

CCTK_ANALYSIS

  accelerator_copyback

  copy memory buffers back to host memory

 

 Language:c
 Type: function