Accelerator

Erik Schnetter <eschnetter@perimeterinstitute.ca>

May 17, 2012

Abstract

Many accelerators and GPU architectures separate device memory from host memory. While device memory may be directly accessible from the host and vice versa, this may include a significant performance penalty. Such architectures require transferring data between host and device, depending on which routines execute where, and depending on which data these routines read and write.

Thorn Accelerator keeps track of which data are valid where (host or device), and schedules the necessary data transfers.

1 Overview

This thorn keeps track of which grid variables are valid where (host or device). Grid variables can be valid either nowhere (e.g. in the beginning), only on the host or only on the device (e.g. if the variable was set by a routine executing there), or both on host and device (e.g. if the variable was set on the device and then copied to the host).

This thorn also examines the schedule to determine which data are required where, and initiates he necessary copy operations.

2 Details

2.1 Scheduled Routines

2.1.1 Preparation

Just before a scheduled routine executes, this thorn examines the schedule to determine where the routine executes: a routine tagged Device=1 executed on a device, a routine without tag or tagged Device=0 executed on the host.

The schedule READS declarations for this routine in the schedule.ccl file (see users’ guide). These declarations specify which grid variables will be read by this routine.

Accelerator then ensures that these grid variables have valid data where the routine executes (host or device). If necessary, data are copied. Accelerator does not perform the necessary copy operations itself; these are performed by lower-level accelerator-architecture aware thorns such as e.g. OpenCLRunTime.

The parameter only_reads_current_timelevel can be used to indicate that all routines only ever read the current timelevel. This may e.g. true for thorns using MoL for time integration.

2.1.2 Bookkeeping

Just after a scheduled routine has finished, this thorn examines the WRITES declarations for that routine, and marks these grid variables as valid on the device and invalid on the host if the routine executed on the device, and vice versa.

The parameter only_writes_current_timelevel can be used to indicate that all routines only ever write the current timelevel. This may e.g. true for thorns using MoL for time integration. However, this may not be true for routines setting up initial data, if these routines set up multiple timelevels.

2.2 Synchronisation

Synchronisation is always performed on the host.

Before synchronising, this thorn ensures that all data that are sent to neighbouring processors during synchonisation are valid on the host, by copying these to the host if necessary. Only the data actually necessary for synchronisation are copied.

After synchronising, this thorn copies ghost zone values from the host to the device, for those grid variable for which the device already has otherwise valid data.

2.3 I/O

I/O is always performed on the host.

Before performing I/O, this thorn copies data from the device to the host. Since I/O routines do not (yet?) declare which variables they access, this thorn has to copy back all variables to the host.

This may be expensive, and the parameters copy_back_every, copy_back_vars, and copy_back_all_timelevels can be used to copy only certain data to the host. It is up to the user to ensure that these settings are consistent with the I/O settings that select which variables are output when.

2.4 Timelevel Cycling

When the driver cycles time levels on the host, this thorn cycles time levels on the device as well, and then marks the new current timelevel as invalid on both host and device.

3 Implementation Details

Accelerator expects to be called by the driver on certain occasions. It provides a set of aliased functions for this that the driver should call. Other infrastructure thorns may also need to call these routines.

Accelerator_Cycle: Cycle timelevels of all grid variables
Accelerator_CopyFromPast: Copy the past timelevel of some grid variables to the current timelevel (used by MoL at the beginning of each integration step)
Accelerator_PreSync: Must be called just before synchronising
Accelerator_PostSync: Must be called just after synchronising
Accelerator_PreCallFunction: Must be called just before calling a scheduled routine
Accelerator_PostCallFunction: Must be called just after calling a scheduled routine
Accelerator_NotifyDataModified: Tell Accelerator that certain grid functions were modified, and are thus valid somwhere and invalid elsewhere
Accelerator_RequireInvalidData: Tell Accelerator that certain grid functions will be accessed write-only somewhere
Accelerator_RequireValidData: Ask Accelerator to ensure that certain grid functions are valid somewhere (by copying data if necessary)

Accelerator expects another thorn to provide the actual device-specific copy operations, e.g. for an OpenCL or CUDA device. These functions must be provided:

Device_CreateVariables: Create (i.e. begin to track) a set of variables/timelevels. This routine will be called exactly once before a variable/timelevel combination is mentioned to the device.
Device_CopyCycle: Cycle all timelevels on the device. (This routine should not block.)
Device_CopyFromPast: Copy past to current time level on the device. (This routine should not block.)
Device_CopyToDevice: Copy data from the host to the device. The return argument indicates whether the data have been copied or moved (i.e. are now invalid on the host). (This routine should not block.)
Device_CopyToHost: Copy data from the device back to the host. The return argument indicates whether the data have been copied or moved. (This routine should block until all data have been copied.)
Device_CopyPreSync: Copy those data from the device back to the host that will be needed for synchronization (i.e. for inter-process synchronization; AMR is not yet supported). (This routine should block until all data have been copied.)
Device_CopyPostSync: Copy the ghost zones to the device. (This routine should not block.)

The note “should not block” indicates that the respective data transfer operation should be performed in the background if possible; it is not necessary to wait until the data transfer has finished before returning. However, “should block” indicates that the routine must wait until all data have been transferred.

4 Restrictions

At the moment, Accelerator only supports unigrid simulations; adaptive mesh refinement or multi-block methods are not yet supported. The main reason for this is that it does not keep track of the additional metadata – this would be straightforward to add.

Accelerator is currently also tied to using Carpet as driver and does e.g. not work with PUGH. The main reason for this is that PUGH and the flesh do not provide the hooks necessary for Accelerator to work – these would also be straightforward to add.

5 Parameters


copy_back_all_timelevels	Scope: private	BOOLEAN

Description: Copy all timelevels back to the host for output

		Default: no


copy_back_all_written_variables_in_analysis	Scope: private	BOOLEAN

Description: Copy all variables that are written to in analysis back to the host

		Default: yes


copy_back_every	Scope: private	INT

Description: When to copy variables back to the host

Range		Default: 1
	never
1:*	every so many iterations


copy_back_vars	Scope: private	STRING

Description: Which variables to copy back

Range		Default: all
.*	list of group or variable names; empty to copy nothing, ’all’ to copy all variables


verbose	Scope: private	BOOLEAN

Description: Output detailed information

		Default: no


veryverbose	Scope: private	BOOLEAN

Description: Output even more detailed information

		Default: no


only_reads_current_timelevel	Scope: restricted	BOOLEAN

Description: Assume that functions read only the current timelevel

		Default: no


only_writes_current_timelevel	Scope: restricted	BOOLEAN

Description: Assume that functions write only the current timelevel

		Default: no

6 Interfaces

General

Implements:

accelerator

Uses header:

carpet.hh

Provides:

Accelerator_Cycle to

Accelerator_CopyFromPast to

Accelerator_PreSync to

Accelerator_PostSync to

Accelerator_PreCallFunction to

Accelerator_PostCallFunction to

Accelerator_NotifyDataModified to

Accelerator_RequireInvalidData to

Accelerator_RequireValidData to

7 Schedule

This section lists all the variables which are assigned storage by thorn CactusUtils/Accelerator. Storage can either last for the duration of the run (Always means that if this thorn is activated storage will be assigned, Conditional means that if this thorn is activated storage will be assigned for the duration of the run if some condition is met), or can be turned on for the duration of a schedule function.

Storage

NONE

Scheduled Functions

CCTK_STARTUP

accelerator_init

initialise accelerator thorn

	Language:	c
	Type:	function

CCTK_ANALYSIS

accelerator_copyback

copy memory buffers back to host memory

	Language:	c
	Type:	function