These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

4.10 Optimization

So far, we presented several framework requirements and how they can be accomplished with a serial traversal of the Sierpiński SFC. This section presents several performance improvements with hardware optimizations and algorithmic developments.

4.10.1 Parameter unrolling

Stack push and pop operations depend on the edge (Section 4.3.3) or vertex types (Section 4.3.4) as well as the order in which they are accessed (Section 4.3.2). A straightforward implementation would lead to inheritance of such types and utilizations of if-branchings for distinguishing between whether a push or pop operation has to be done as well as on which stack such an operation has to be executed.

We only consider new and old edge types without loss of generality and start by describing the issues with the non-optimized version: For different if-branches used to distinguish between these two edge types, a typical branch prediction would lead to a miss rate of 50% in average. We expect this to lead to less performance due to discarding instructions in the pipeline and thus reduced efficiency for the grid traversal.

With our code generator (Section 4.9.5), we can consider the recursive spacetree traversal code be given by

S := foobar(P1,P2, ...,Pn, ...){Source(P1,P2,...,Pn,...)}
With Source() representing a source code with Pi as parameters. Here, the considered parameters Pi are within a relatively small parameter range Pi (pi1,,pini) with ni the number of possible parameters for the i-th parameter. Such a parameter range can e.g. be the edge types {new,old,boundary}, the orientation, etc. With the traversal source code, we can create specialized code for each state which we further refer to as parameter unrolling. This yields iPi possible specializations of the form
SP xPx...xP (...){SourceP  ,P ,...,P (...)}
  1  2   n              1 2   n
which on the first glimpse seems to be a drawback due to increased lines of code and hence increased code size. However, this approach allows avoiding all if-branchings required by different parameters Pi, thus avoiding if-branch mispredictions induced by new and old edge type labels.

We compared grid traversals with and without parameter unrolling on an Intel(R) Xeon(R) CPU X5690 with 3.47GHz. A regular grid resolution with 22 refinement levels is used with 12 floating point operations in each cell. The vertices are computed during the grid traversal on-the-fly.

On the one hand, this leads to increased size of machine code and therefore a severely increased instruction cache miss rate by a factor of 33.7. On the other hand, the if-branching mispredictions are reduced by a factor of 4.7 in average. Having a look at concrete runtime results shows that the avoidance of if-branching mispredictions outperforms the increased instruction cache misses:

parameter unrolling with parameters performance increase

GNU Compiler 5.066 sec 6.044 sec 19.3%

Intel Compiler 3.806 sec 5.130 sec 34.8%

Hence, such an optimization is important for kernels with only a few operations.

4.10.2 Recursive grid traversal and inlining

Due to the recursive grid traversal, we should be aware of the overhead of recursive methods. This overhead involves calls of our methods as well as the parameter handling of the methods. Therefore, kernels and traversal methods are prefixed with the inline statement, instructing the compiler not to use a function call to this method but to inline the code of the method directly. This aims at avoiding function calls. Similar inline optimizations are also used for the stack and other core operations.

4.10.3 Adaptivity automaton

With the adaptivity traversals presented in Section 4.6, a single traversal only considers the change of adaptivity state, but not the direction of propagation of the adaptivity markers.

In case that a refinement marker is forwarded via an edge and the traversal direction propagates this information to the adjacent cell within the same traversal, this assures the processing of the refinement marker within the same traversal. Thus forwarding and processing the refinement marker in the same traversal is assured for edges of type new. If all refinement markers can be forwarded in the traversal, also the grid is assumed to be in conforming state.

incoming edge marker
state t 000 001 010 011 100 101 110 111

no request 0 000,0 100,5 100,6 100,7 000,4 000,5 000,6 000,7

INVALID 1 000,1 000,1 000,1 000,1 000,1 000,1 000,1 000,1

local coarsening req. 2 000,2 100,5 100,6 100,7 000,4 000,5 000,6 000,7

local refine request 3 100,4 100,5 100,6 100,7 000,4 000,5 000,6 000,7

refined: hyp 4 000,4 000,5 000,6 000,7 000,4 000,5 000,6 000,7

refined: hyp, left 5 000,5 000,5 000,7 000,7 000,5 000,5 000,7 000,7

refined: hyp, right 6 000,6 000,7 000,6 000,7 000,6 000,7 000,6 000,7

ref.: hyp, right, left 7 000,7 000,7 000,7 000,7 000,7 000,7 000,7 000,7

Table 4.1: Transistion table for our adaptivity automaton. Each row represents the current transition state t and each column the incoming state change request based on the incoming adaptivity markers MR (bit is set) and M0 (bit is unset). The new transition state (f,t) is then given by the transition to the adaptivity state t and bitencoded information on edges for which edge markers MR still have to be forwarded to generate a conforming grid.

We construct an automaton with initial states given by the adaptivity states, transition states based on the adaptivity markers read via edge communication and remaining adaptivity markers to be forwarded. This automaton can be written in a tabular format (see Table 4.1) and is used as follows.

Each grid cell has its own adaptivity state S := (f,t) which consists out of the refinement information which still has to be forwarded via each edge in a bit field f := {0,1}3 for edges (e1,e2,e3); t represents one of the adaptivity states from (0) to (7).

Considering refinement operations only, the transition function is then based on a table lookup using the current adaptivity state f selecting the row and the input bit fields storing the incoming edge markers selecting the column in Table 4.1. The new state of a single cell is then given by the entry (f,t) of the table cell. Bits set in fupdate represents the required refinement information which still has to be forwarded via edges.

Initialization: For initialization, the adaptivity state t is set to (0) for no adaptivity request, (2) for a local coarsening request and (3) for a local refinement request. The forward information bits f are set to (100) for state (3) avoiding the hanging node, otherwise to (000). The transition state tuple (f,t) is then stored to the adaptivity state stack and updated during the following grid traversals to generate a conforming grid state.
Transition: The transition is based on the incoming edge markers for refinement requests which we store to the tuple T := (e1,e2,e3).

Based on the current transition state (f,t), the tuple of the refinement requests t to be forwarded to adjacent cells and the new transition state can be determined via a lookup in the automaton table with the current incoming refinement requests T, yielding (fupdate,tnew).

In case of a cell transition set to coarsening (2), the propagation of the coarsening agreements marker MC is immediately stopped in case of an incoming marker fetched via a cathetus is not of type MC.

Forwarding of refinement information: Since the information to be forwarded is already included in f, required refinement markers are also stored in fupdate. A bitmask is used with bits set for edges of type old. This bitmask is ANDed to f to account for edges of type (a) new with refinement information propagated during the current cell traversal and (b) boundary which do not require forwarding of refinement information.

This adaptivity automaton is able to avoid at most one traversal for serial traversals. However it turns into a crucial component for the optimization with the cluster skipping approach in the parallelization (See Section 5.8.2).

4.10.4 CPU SIMD optimizations for inter-cell computations (fluxes)

With the trend of computing cores requiring operations on data stored in vectors to get close to the peak of floating point operations, utilization of vector instructions is mandatory for HPC-oriented developments.

Regarding the flux computations on the edges for higher-order DG simulations, multiple nodal points for flux evaluation are available on each edge. This allows a straightforward optimization with SIMD operations.

For 0th-th order discretization, however, only DoF of a single node on the edge is stored per edge. Here, SIMD optimizations are still possible for a pair of nodal points e.g. by computing multiple square roots in the Rusanov solver with a single SIMD operation.

In this section, we like to focus on an alternative method for 0th-th order discretizations. We consider, that the nodal-wise given conserved quantities are stored consecutively on the edge buffer stacks. 4 Since the DoF on the edge buffer stack are stored in the format of arrays of structure and since SIMD operations typically demand for a structure of arrays, the data has to be rearranged. We consider a block of multiple conserved quantities Ui+,Ui- for the left and right state of an edge stored to the edge buffer stack:

        +   -   +   -   +   -   +   -
S   = (Ue1,Ue1,Ue2,Ue2,Ue3,Ue3,Ue4,Ue4)                   (4.21)
    = ((q+e11 ,q+e21 ,q+e13,q+e14),(q-e11,qe-12,q-e13,q-e14),               (4.22)
        +1  +2  +3  +4    -1  - 2 - 3 - 4
      (qe2 ,qe2 ,qe2 ,qe2 ),(qe2 ,qe2 ,qe2 ,qe2 ),            (4.23)
      (q+e13 ,q+e23 ,q+e33 ,q+e43 ),(q-e13 ,q-e32 ,q-e33,q-e34),             (4.24)
      (q+1,q+2,q+3 ,q+4 ),(q-1,q- 2,q- 3,q- 4)).               (4.25)
        e4  e4  e4  e4    e4  e4  e4  e4
Assuming that this is an 8 × 4 matrix, we can transpose this matrix. This yields four components of the same conserved quantity on either the left or the right side of the edge in each vector:
          +1  +1  +1  +1
S   =  ((qe1 ,qe2 ,qe3 ,qe4 ),                      (4.26)
        (q+e12,q+e22,q+e23 ,q+e24 ),                       (4.27)
        (q+3,q+3,q+3,q+3),                         (4.28)
          e1+4  e2+4  e+34  e+44
        (qe1 ,qe2 ,qe3 ,qe4 ),                       (4.29)
        (q-e11,q-e21,q-e13 ,q-e14 ),                       (4.30)
          -2  -2  -2  -2
        (qe1 ,qe2 ,qe3 ,qe4 ),                       (4.31)
        (q-e13,q-e23,q-e33 ,q-e34 ),                       (4.32)
        (q-4,q-4,q-4,q-4)).                        (4.33)
          e1   e2   e3  e4
With a typical flux solver such as the Rusanov flux solver, an optimized SIMD evaluation of the conserved quantities is then possible. After flux computations, the data layout has to be converted back from the structure-of-arrays to the arrays-of-structure format. Such a reordering in its generic form is also known as gather and scatter operations, respectively, for packing data into a vector-processable format and inverting this packaging (cf.  [EHB+13]). In our case, a matrix transposition was sufficient.

As a proof of concept, we conducted benchmarks based on the augmented Riemann solver [Geo06] which were SIMD optimized in [Höl13]. In this work, the SIMD-optimized augmented Riemann solver is implemented with single precision. We compare the SIMD performance improvements with this solver based on simulations with a radial dam break szenario. The simulation is executed with an initial refinement depth of d = 16 and with a = 10 additional levels of adaptive grid refinement. The SIMD optimization which we used for the benchmarks stores 4 single-precision components in each vector, and we execute the simulation for 100 time steps.

We measure the optimization for the edge communication and adaptivity traversal times given in seconds:

SIMD disabled SIMD enabled

Time step 10.86 6.66

Adaptivity 3.99 3.93

The time to compute a time step is improved by 38.7% but the maximal theoretical improvement of 75% was not reached. We account for that by (a) the branch divergence inside the flux solver resulting in non-parallalelized sections and (b) the grid traversal time which was not optimized with the SIMD flux evaluation. The runtime for the adaptivity traversals is not reduced since the SIMD optimized solvers are only used in the time step traversals.

4.10.5 Structure of arrays for cell-local computations

Considering the flux computations of conserved quantities with higher-order basis functions (see Section 2.3), we can use SIMD instructions by storing the weights for the basis functions in a structure of arrays. We store the n weights qji with j ∈{1,,n} for a particular conserved quantity i of a cell consecutively in memory. For four conserved quantities, this yields

(                                             )
 q11,q12,...,q1n,  q21,q22,...,q2n,  ...  q41,q42,...,q4n .
Using this structure of arrays with the matrix formulation of the DG time stepping scheme (see Part II), allows us to compute the time step integration mainly with matrix-vector or matrix-matrix multiplications.

In this work, we followed this structure-of-arrays format for our conserved quantities, however, we did not further focus on optimization of such matrix-matrix computations.

4.10.6 Prospective stack allocation

A straightforward implementation of the stack system with the C++ std::vector class from the standard library would lead to a performance slowdown: an obvious example is given by frequent memory free() and alloc() operations if not pre-allocating sufficient memory and thus re-utilization of already allocated memory. Pre-allocating far more memory than required leads to a severe additional memory consumption, possibly exceeding the available memory.

With a prospective stack size allocation based on an upper limit for the maximum required allocated stack, we can overcome testing for exceeding stack access and avoid any stack resize operations during grid traversals. Using the knowledge on the number of cells c (See [Vig12]) for our simulation domain assembled by a triangle, upper limits for the storage requirements on all stacks can be determined. The number of cells c after all adaptivity traversals on a grid can be derived based on the adaptivity state information before updating the grid structure. The number of grid cells are increased for each inserted edge and decreased for a pair of coarsening requests stored on the adaptivity state stack. See e.g. Fig. 4.8 with the red-dashed lines representing the inserted edges for the refinement operations (3) to (7).

Here, we infer the upper limits of the required stack storage:

Simulation cell data SsimCellData:

The cell data stack size is directly set to c.

Structure stack Sstructure:

An upper limit of the size of the structure stack is directly given by the recursive structure. With a cell split achieved by replacing 0 with 100 on the structure stack, this creates 2 additional bits for each new cell. Starting with a single cell represented by a stack storing |0 only, i.e. only one entry, this yields

max (|Sstructure|) := 1+ 2(c - 1) = 2c - 1
Edge communication data SsimEdgeComm, SadaptiveEdgeComm:

Edge communication data such as SsimEdgeComm for running the simulation or SadaptiveEdgeComm for adaptivity traversals is limited by recursive bisection of the communication edges. We follow a greedy approach successively splitting triangles targeting at creating as many edges as possible at the triangle boundary at the root spacetree node. This leads to a sequence of maximum edge communication elements E with Ei representing the upper limit of required stack storage for the i-th inserted edge. For the right communication stack, this yields Eright := (1,2,2,3,3,4,4,5,5,6,6,) and Eleft := (2,2,3,4,4,5,5,6,6,7,7,) for the left communication stack. With Ei := max(Eileft,Eiright) = Eileft, an upper limit is given by

      {sim,adaptive}EdgeComm            c
max (|S                    |) := floor(2)+ 2.
By splitting the root triangle into two children and considering the maximum number of communication edges between both children, the cardinality is still less or equal to floor(c
2) + 2.
Edge buffer data SsimEdgeBuffer:

For the root triangle, at least six edge buffer elements are required due to storing up to six boundary edge data elements on the buffer stacks. With each refinement of child triangles, two additional edges are created for the shared edge and two additional edges are created due to splitting the edge on the hypotenuse. This demands for additional storage of four elements on the stack system. Thus, the maximum number of required edge buffer data for edge data is

max (|SsimEdgeBuffer|) := 6 + 4(c- 1) = 2+ 4 c.
Vertex communication data SvertexComm:

For vertex data, only two nodes can be stored on each communication stack. Following a greedy edge insertion focussing on creating as many vertices as possible on the boundary of the domain created by the spacetree, this yields

vleft := (3,3,4,5,5,6,6,7,7,...)
v    := (2,3,3,4,4,5,5,6,6,...)
With max(vileft,viright) = vileft, an upper limit is given by
max (|SvertexComm |) := floor(c)+ 3 .
Vertex data buffer SvertexBuffer:

For the visualization, the vertex buffer only requires data storage for the vertices inside the domain since the boundary vertex data is stored to the vertex communication stacks. Each insertion of an edge creates one additional vertex. Since those inner vertices can be created only with more than 4 cells and one additional vertex is generated for each new cell, this yields:

max (|S          |) := max (c- 4,0)
We like to emphasize, that e.g. a node-based communication for flux limiters of DG simulations can lead to other limits.
Resizing stacks:

We derived maximum stack size capacities to store all data during a traversal based on a given number of simulation cells c. This new snumber of simulation cells was derived with the adaptivity state flags on the stack after the conforming adaptivity state was detected.

Before the last backward adaptivity traversal, we then reallocate the output stacks with the new capacity. The other stacks are reallocated after the last backward adaptivity traversal. This avoids resize operations during all traversals.

However this would still yield a frequent reallocation of stack data if the number of cells changes. Therefore, we introduce two additional padding values: Ppadding and Pundershoot. These thresholds are applied in case of exceeding and undershooting a particular number of simulation cells.

A reallocation is done only if |SsimCellData| < c or |SsimCellData|-Pundershoot > c. This reallocates the stacks for c + Ppadding simulation cells.