## MemSpeed

June 22, 2013

Abstract

Determine the speed of the CPU, as well as latencies and bandwidths of caches and main memory. These provides ideal, but real-world values against which the performance of other routines can be compared.

### 1 Measuring Maximum Speeds

This thorn measures the maximum practical speed that can be attained on a particular system. This speed will be somewhat lower than the theoretical peak performance listed in a system’s hardware description.

This thorn measures

• CPU ﬂoating-point peformance (GFlop/s),
• CPU integer peformance (GIop/s),
• Cache/memory write latency (ns),
• Cache/memory write bandwidth (GByte/s).

Theoretical performance values for memory access are often quoted in slightly diﬀerent units. For example, bandwidth is often measured in GT/s (Giga-Transactions per second), where a transaction transfers a certain number of bytes, usually a cache line (e.g. 64 bytes).

A detailed understanding of the results requires some knowledge of how CPUs, caches, and memories operate. [1] provides a good introduction to this as well as to benchmark design. [2] is also a good read (by the same authors), and their (somewhat dated) software lmbench is available here [3].

### 2 Algorithms

We use the following algorithms to determine the maximum speeds. We believe these algorithms and their implementations are adequate for current architectures, but this may need to change in the future.

Each benchmark is run $N$ times, where $N$ is automatically chosen such that the total run time is larger than $1$ second. (If a benchmark ﬁnishes too quickly, then $N$ is increased and the benchmark is repeated.)

#### 2.1 CPU ﬂoating-point peformance

CPUs for HPC systems are typically tuned for dot-product-like operations, where multiplications and additions alternate. We measure the ﬂoating-point peformance with the following calculation:

for (int i=0; i<N; ++i) {
s := c_1 * s + c_2
}

where $s$ is suitably initialized and ${c}_{1}$ and ${c}_{2}$ are suitably chosed to avoid overﬂow, e.g. $s=1.0$, ${c}_{1}=1.1$, ${c}_{2}=-0.1$.

$s$ is a double precision variable. The loop over $i$ is explicitly unrolled $8$ times, is explicitly vectorized using LSUThorns/Vectors, and uses fma (fused multiply-add) instructions where available. This should ensure that the loops runs very close to the maximum possible speed. As usual, each (scalar) multiplication and addition is counted as one Flop (ﬂoating point operation).

#### 2.2 CPU integer performance

Many modern CPUs can handle integers in two diﬀerent ways, treating integers either as data, or as pointers and array indices. For example, integers may be stored in two diﬀerent sets of registers depending on their use. We are here interested in the performance of pointers and array indices. Most modern CPUs cannot vectorize these operations (some GPUs can), and we therefore do not employ vectorization in this benchmark.

In general, array index calculations require addition and multiplication. For example, accessing the element $A\left(i,j\right)$ of a two-dimensional array requires calculating $i+{n}_{i}\cdot j$, where ${n}_{i}$ is the number of elements allocated in the $i$ direction.

However, general integer multiplcations are expensive, and are not necessary if the array is accessed in a loop, since a running index $p$ can instead be kept. Accessing neighbouring elements (e.g. in stencil calculations) require only addition and multiplication with small constants. In the example above, assuming that $p$ is the linear index corresponding to $A\left(i,j\right)$, accessing $A\left(i+1,j\right)$ requires calculating $p+1$, and accessing $A\left(i,j+2\right)$ requires calculating $p+2\cdot {n}_{i}$. We thus base our benchmark on integer additions and integer multiplications with small constants.

We measure the ﬂoating-point peformance with the following calculation:

for (int i=0; i<N; ++i) {
s := b + c * s
}

where $b$ is a constant deﬁned at run time, and $c$ is a small integer constant ($c=1\dots 8$) known at compile time.

$s$ is an integer variable of the same size as a pointer, i.e. 64 bit on a 64-bit system. The loop over $i$ is explicitly unrolled $8$ times, each time with a diﬀerent value for $c$. Each addition and multiplication is counted as one Iop (integer operation).

Memory read access latency is measured by reading small amounts of data from random locations in memory. This random access pattern defeats caches, because caching does not work for random access patterns. To ensure that the read operations are executed sequentially, each read operation needs to depend on the previous. The idea for the algorithm below was taken from [1].

To implement this, we set up a large linked list where the elements are randomly orderd. Traversing this linked list then measures the memory read latency. This is done as in the following pseudo-code:

struct L { L* next; };
... set up large circular list ...
for (int i=0; i<N; ++i) {
ptr = ptr->next;
}

To reduce the overhead of the for loop, we explicitly unroll the loop 100 times.

We use the hwloc library to determine the sizes of the available data caches, the NUMA-node-local, and the global amount of memory. We perform this benchmark once for each cache level, and once each for the local and global memory:

• for a cache, the list occupies 3/4 of the cache;
• for the local memory, the list occupies 1/2 of the memory;
• for the global memory, the list skips the local memory, and occupies 1/4 of the remaining global memory.

To skip the local memory, we allocate an array of the size of the local memory. Assuming that the operating system prefers to allocate local memory, this will then ensure that all further allocations will then use non-local memory. We do not test this assumption.

Memory read access bandwidth is measured by reading a large, contiguous amount of data from memory. This access pattern beneﬁts from caches (if the amount is less than the cache size), and also beneﬁts from prefetching (that may be performed either by the compiler or by the hardware). This presents thus an ideal case where memory is read as fast as possible.

To ensure that data are actually read from memory, it is necessary to consume the data, i.e. to perform some operations on them. We assume that a ﬂoating-point dot-product is among the fastest operations, and thus use the following algorithm:

for (int i=0; i<N; i+=2) {
s := m[i] * s + m[i+1]
}

As in section 2.1 above, $s$ is a double precision variable. $m\left[i\right]$ denotes the memory accesses. The loop over $i$ is explicitly unrolled $8$ times, is explicitly vectorized using LSUThorns/Vectors, and uses fma (fused multiply-add) instructions where available. This should ensure that the loops runs very close to the maximum possible speed.

To measure the bandwidth of each cache level as well as the local and global memory, the same array sizes as in section 2.3 are used.

#### 2.5 Cache/memory write latency

The notion of a “write latency” does not really make sense, as write operations to diﬀerent memory locations do not depend on each other. This benchmark thus rather measures the speed at which independent write requests can be handled. However, since writing partial cache lines also requires reading them, this benchmark is also inﬂuence by read performance.

To measure the write latency, we use the following algorithm, writing a single byte to random locations in memory:

char array[N];
char* ptr = ...;
for (int i=0; i<N; ++i) {
*ptr = 1;
ptr += ...;
}

In the loop, the pointer is increased by a pseudo-random amount, but ensuring that it stays within the bound of the array.

To measure the bandwidth of each cache level as well as the local and global memory, the same array sizes as in section 2.3 are used as starting point. For eﬃciency reasons, these sizes are then rounded down to the nearest power of two.

#### 2.6 Cache/memory write bandwidth

Memory write access bandwidth is measured by writing a large, contiguous amount of data from memory, in a manner very similar to measuring read bandwidth. The major diﬀerence is that the written data do not need to be consumed by the CPU, which simpliﬁes the implementation.

We use memset to write data into an array, assuming that the memset function is already heavily optimized.

To measure the bandwidth of each cache level as well as the local and global memory, the same array sizes as in section 2.3 are used.

### 3 Caveats

This benchmark should work out of the box on all systems.

The only major caveat is that it does allocate more than half of the system’s memory for its benchmarks, and this can severely degrade system performance if run on an interactive system (laptop or workstation). If run with MPI, then only the root process will run the benchmark.

Typical memory bandwidth numbers are in the range of multiple GByte/s. Given today’s memory amounts of many GByte, this means that this benchmark will run for tens of seconds. In addition to bencharking memory access, the operating system also needs to allocate the memory, which is surprisingly slow. A typical total execution time is several minutes.

### 4 Example Results

The XSEDE system Kraken at NICS reports the following performance numbers (measured on June 21, 2013):

INFO (MemSpeed): Measuring CPU, cache, and memory speeds:
CPU floating point performance: 10.396 Gflop/sec for each PU
CPU integer performance: 6.23736 Giop/sec for each PU
D1 cache read latency: 1.15434 nsec
L2 cache read latency: 5.82695 nsec
L3 cache read latency: 29.4962 nsec
local memory read latency: 135.264 nsec
global memory read latency: 154.1 nsec
D1 cache read bandwidth: 72.3597 GByte/sec for 1 PUs
L2 cache read bandwidth: 20.7431 GByte/sec for 1 PUs
L3 cache read bandwidth: 9.51587 GByte/sec for 6 PUs
local memory read bandwidth: 5.19518 GByte/sec for 6 PUs
global memory read bandwidth: 4.03817 GByte/sec for 12 PUs
Write latency:
D1 cache write latency: 0.24048 nsec
L2 cache write latency: 2.8294 nsec
L3 cache write latency: 9.32924 nsec
local memory write latency: 47.5912 nsec
global memory write latency: 58.1591 nsec
Write bandwidth:
D1 cache write bandwidth: 39.3172 GByte/sec for 1 PUs
L2 cache write bandwidth: 12.9614 GByte/sec for 1 PUs
L3 cache write bandwidth: 5.5553 GByte/sec for 6 PUs
local memory write bandwidth: 4.48227 GByte/sec for 6 PUs
global memory write bandwidth: 3.16998 GByte/sec for 12 PUs

The XSEDE system Kraken at NICS also reports the following system conﬁguration via thorn hwloc (reported on June 21, 2013):

INFO (hwloc): Extracting CPU/cache/memory properties:
There are 1 PUs per core (aka hardware SMT threads)
Cache (unknown name) has type "data" depth 1
size 65536 linesize 64 associativity 2 stride 32768, for 1 PUs
Cache (unknown name) has type "unified" depth 2
size 524288 linesize 64 associativity 16 stride 32768, for 1 PUs
Cache (unknown name) has type "unified" depth 3
size 6291456 linesize 64 associativity 48 stride 131072, for 6 PUs
Memory has type "local" depth 1
size 8589541376 pagesize 4096, for 6 PUs
Memory has type "global" depth 1
size 17179082752 pagesize 4096, for 12 PUs

Kraken’s CPUs identify themselves as 6-Core AMD Opteron(tm) Processor 23 (D0).

Let us examine and partially interpret these numbers. (While the particular results will be diﬀerent for other systems, the general behaviour will often be similar.)

Kraken’s compute nodes run at 2.6 GHz and execute 4 Flop/cycle, leading to a theoretical peak performance of 10.4 GFlop/s. Our measured number of 10.396 GFlop/s is surprisingly close.

For integer performance, we expect half of the ﬂoating point performance since we cannot make use of vectorization, which yields a factor of two on this architecture. The reported number of 6.2 GIop/s is somewhat larger. We assume that the compiler found some way to optimize the code that we did not foresee, i.e. that this benchmark is not optimally designed. Still, the results are close.

The read latency for the D1 cache is here diﬃcult to measure exactly, since it is so fast, and the cycle time and thus the natural uncertainty is about 0.38 ns. (We assume this could be measured accuratly with suﬃcient eﬀort, but we do not completely trust our benchmark algorithm.) We thus conclude that the D1 cache has a latency of about 1 ns or less. A similar argument holds for the D1 read bandwidth – we conclude that the true bandwidth is 72 GB/s or higher.

The L2 cache has a higher read latency and a lower read bandwidth (it is also signiﬁcantly larger than the D1 cache). We consider these performance numbers now to be trustworthy.

The L3 cache has again a slightly slower read performance than the L2 cache. The major diﬀerence is that the L3 cache is shared between six cores, so that the bandwidth will be shared if several cores access it simultaneously.

The local memory has a slightly lower read bandwith, and a signiﬁcantly higher read latency than the L3 cache. The global memory is measurably slower than the local memory, but not by a large margin.