These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

5.8 Cluster-based optimization

We continue with an overview of possible optimizations:

5.8.1 Single reduce operation for replicated data

Using the replicated communication scheme (see Section 5.2), there are basically two ways to reduce the replicated data:

We conducted two experiments based on a shallow water simulation with a radial dam break, 18 initial refinement levels and up to 8 additional levels of adaptivity. Since we assume that the benefits of avoiding duplicated reduce operations depend on the costs of the flux solvers, we run two different benchmarks - each one with a different flux solver: the relatively lightweight Rusanov solver and the heavyweight augmented Riemann solver from the GeoClaw package [Geo08].

The saved computations also depend on the number of shared cluster interfaces, hence we executed benchmarks parametrized with different split thresholds. The results are based on a single-core execution and are given in Fig. 5.28 and they show, that possible performance improvements depend on the utilized flux solver.


pict

Figure 5.28: Performance speed up and speed down if avoiding duplicated reduction of replicated data.

1.
For the lightweight flux solver, avoiding reduce operations results in a performance decrease for smaller clusters. An explanation for this effect is given by the threading overheads: since we require additional synchronization operations, there is an additional overhead introduced by executing additional tasks. For the Rusanov flux solver, the synchronization overhead is higher than the benefits.
2.
Using heavyweight solvers, we see a speedup for optimization and assume that the computational intensity of the solver outperforms the tasking overheads for the additional synchronization.

To verify these assumptions, we develop a simplified model based on the following quantities:

Since the computational speedup can also be computed for each cluster, our model only considers the computational costs for a single cluster. These costs for a cluster with a single reduce operation are then given by

C (single)(o,e,c,d) :=           o         +       2d ⋅c     +      1.5-⋅2d ⋅e
                   Additional◟ s◝y◜n◞ch. overhead  Comput◟a◝ti◜o◞ns for cells Compu◟tat◝io◜ns fo◞r edges
whereas the costs with two reduce operations per shared hyperface are given by
                          d                    d
C (double)(o,e,c,d) :=       2◟◝◜⋅c◞      +     1◟.5-⋅◝2◜-⋅e◞     +        2◟d◝◜⋅e◞       .
                   Computations for cells Computations for edges Additional comp. for edges

pict

Figure 5.29: Performance speed up and speed down if avoiding duplicated reduction of replicated data.

We compute the speedup using the single evaluation with C(double)-C(single)
    C(double) and use rule-of-thumb values o = 1000, e.g. 1000 cycles to dequeue a task from the worker’s queue and c = 100 as the computational workload per cell. The speedup graph for different refinement depths and thus cluster sizes is given in Fig. 5.29 for different computational costs for edges.

The shapes of our experimentally determined results in Fig. 5.28 match to our model. We find the agreement, that in case of a low computational cost involved in each edge, there’s no benefit in using a single reduce operation. For computational intensive edge reductions as it is the case for the augmented Riemann solver, using a single reduce operation can be beneficial. However, as soon as the number of cells in each cluster is getting higher, the benefits tend towards zero.

We conclude, that for small cluster sizes, avoiding duplicated reduction operations can be beneficial for small cluster, e.g. required for strong-scaling problems and the algorithm presented in the next section.

5.8.2 Skipping of traversals on clusters with a conforming state

So far, we only considered adaptivity conformity traversals being executed by traversing the entire domain grid. For parallelization, we decomposed our domain with clusters and replicated data scheme to execute traversals massively parallel also in a shared-memory environment. For adaptivity traversals aiming at creating a conforming grid, the cluster approach is similar to a red-black coloring approach with information on hanging nodes possibly not forwarded to adjacent clusters in the current adaptivity traversal. Thus more traversals can be required in case of parallelization.

However, a clustering approach offers a way to compensate these conformity traversals if there are more than one clusters per compute unit: adaptivity traversals can be skipped on clusters that (a) already have a conforming grid and (b) do not receive a request of an adjacent cluster to insert an edge (adaptivity marker MR). Both requirements are discussed next:

(a)
To determine whether the adaptivity states already lead to a conforming cluster, we tag each cluster with a “non-conforming” boolean which is set to false before any adaptivity traversals. During the adaptivity traversals, as soon as information on inserting an edge is not propagated with the current traversal, this boolean is set to true. Such information on non-propagated hanging nodes is directly given by the forward information bits f (see Section 4.10.3): If these bits are not equal to zero, adaptivity information still has to be propagated by an additional traveral.
(b)
Testing for adaptivity request propagations from adjacent clusters is accomplished by checking the left and right adaptivity communication buffers stored at the adjacent cluster which store adaptivity markers MR for corresponding requests. In case that all values are not set to MR, no adaptivity edge-insertion requests are propagated from the adjacent cluster.

This algorithm is based on the sparse graph representation (see Fig. 5.5 of our clustered domain: using this graph representation, our algorithm can be represented by a Petri net [SWB13a].

Algorithm: Skipping of clusters with a conforming adaptivity state

Here, each cluster is a node and the directed arcs exist for each RLE entry. Tokens represent requests for additional conformity traversals. We then distinguish between local consistency traversal tokens and adjacent consistency tokens with the latter one forwarded to, or received from adjacent clusters.

-


______________________________________

Results for short-term simulations:

pict

Figure 5.30: Overview of scalability using different cluster generation strategies with adaptive conforming cluster skipping enabled. The legend is given in the format “scheduling strategy / cluster generation” [SWB13a].

We conducted several runs for the SWE benchmark scenario from Sec. 5.7. The results are given in Fig. 5.30 with the linear scalability shown for the implementation without adaptive conforming cluster skipping. In compliance with the results for the non-skipping version, the range-based cluster generation suffers of additional overheads due to more frequent cluster generation operations. However, considering the pure task-flooding with a threshold-based cluster generation, we get a higher performance compared to the basic implementation due to algorithmic cluster-based optimizations.

Detailed statistics on the skipping algorithm:

pict

Figure 5.31: Comparing run time of skipping and non-skipping of conforming cluster. Here, we distinguish between the cluster generation time (Cluster) which is negligibly small, time for adaptivity tarversals (Adaptivity) and time to run the computations for the time step (Simulation) [SWB13a].

We further analyze the robustness of the adaptive cluster skipping and detailed statistics on the run time separated into the three major phases of a single simulation time step: (a) generating clusters, (b) adaptivity traversals and (c) time step traversals. The results are given in Fig. 5.31 for a different number of threads executed on the same benchmark scenario. For all tested numbers of threads, the skipping algorithm yielded a robust performance improvement. The time spent for cluster generation is negligibly small. We account for that by the required decreased time step size due to the higher-order spatial discretization. This also leads to less changes in the grid and therefore less executions of cluster generation. The presented adaptive cluster skipping also leads to robust performance improvements on distributed memory simulations, see Sec. 5.12.1.

5.8.3 Improved memory consumption with RLE meta information

Our run-length encoding provides an optimization for the blockw-wise data communication for stack-based communications. This section highlight the potential of saved memory storage by using our RLE meta information.

Improved memory consumption results based on simulation

With statistics gathered for executed simulations, we compare the number of RLE-stored meta information to the number of elements required if we would store the communication primitive separately [SWB13b].

Here, we assume the meta information on vertices associated to one RLE edge communication entry is not stored, but inferred by the per-hyperface stored edge meta information. To give an example, we assume an RLE encoding ((A,2),(B,0),(C,3)) to be stored with ((A,e),(A,e),(B,v),(C,e),(C,e),(C,e)), see Sec. 5.2.3. We then compute the ratio between the quantity of entries required for our RLE scheme QR := |((A,2),(B,0),(C,3))| to the number of non-RLE encoded entries QN := |((A,e),(A,e),(B,v),(C,e),(C,e),(C,e))|. This ratio Q := QN-
QR then yields the factor of memory saved with our RLE meta information compared to storing the meta-communication information separately for the communication hyperfaces.

We conducted empirical tests with a hyperbolic simulation based on (a) the Euler and (b) the shallow water equations. Here, we used edge-based communication for the simulation and node-based communication for a visualization in each time step. The simulation is started with an initial refinement depth of d = 6 and up to a = 14 additional refinement levels. The simulations are initialized with the same radial dam break, using the density as the height for the Euler simulation, and we simulate 100 seconds. This resulted in execution statistics given in the table below:

time steps min. number of cells max. number of cells




Euler 31 34593 40228




Shallow water 733 34593 178167





pict

Figure 5.32: Ratio of entries for RLE meta information to non-RLE meta information for different cluster sizes for the simulation based on the shallow water and Euler equation. The error bars show the minimum and maximum savings of RLE meta information over the entire simulation time [SWB13b].

We plot the ratio Q for our simulation scenario above in Fig. 5.32. Larger clusters lead to an increased ratio due to more shared hyperfaces per cluster and hence more entries to be compressed with our RLE meta information. The min/max bounds for the largest tested cluster size with the Euler simulation is not as large as for the shallow water equation. We account for that by two statistical effects: the first one is a shorter run time which leads to less possible values involved in the min/max computation. The second one is, that less time steps for the Euler simulation also lead to less cells generated in average with a maximum of 40228 cells over the entire simulation (see the table above). This leads to only a low average number of clusters during the entire simulation, almost no dynamic cluster generation in the simulation and thus less values involved into the min/max error range. For typical cluster sizes of more than 4000 cells, the memory requirements for the meta information are reduced by factor of 9.

RLE vs. per-cell stored meta information

For simulations executed on unstructured grids, the meta communication information is typically stored for each cell. This is considered to be especially memory consuming for meta information on vertices due to more than a single adjacent cell compared only a single adjacent cell with edges. However, the ratio of the meta information overhead per-cell compared to our cluster-shared-hyperface RLE encoding depends on the number of degrees of freedom stored in each cell which we further analyze with a model [SWB13b]. Here, we model the payload (DoF) per cell with W and assume a regular refinement depth d, yielding 2d cells per cluster. We also assume that the non-RLE meta information takes the same amount of memory as a single DoF. Furthermore, we neglect the requirement of storing the cell orientation since this can be bit-compressed and stored in a cell requiring less storage than a single DoF. Such a cell orientation can be required if accessing the adjacent cell.

pict With the grid generated by the bipartitioning Sierpiński SFC and based on our assumption of a regular grid, the required memory for the meta information on adjacent cells for each triangle cell and the workload for each cell are given by

S :=   3          (3 edge adjacency information )
      + 2 ⋅5+ 1  (vertex adjacency information )
      + W
with 3 edge-adjacent cells and 2 5 + 1 required meta information on the per-vertex additional cells, see right handed figure.

Storing the meta information only for the cluster boundaries, this yields R := 2 (3 + 2 5 + 1) RLE entries to store the adjacent information with the factor 2 accounting for the two-element tuple for each entry.

The ratio of the number of cells times the memory requirements to the RLE encoded memory requirements then yields the memory saved for each cluster:

--S-⋅2d---   (3+-2-⋅5-+-1+-W-)-⋅2d 2d→ ∞,L.H. 3-+-2⋅5-+-1+-W--   14-
R + W  ⋅2d =      R  + W ⋅2d          =            W        =  W  + 1
We get an upper bound of 14
W- + 1 which only depends on the per-cell payload W.

pict

Figure 5.33: Ratio of saved memory by using our RLE encoding compared to per-cell stored adjacency information [SWB13b].

We plotted this ratio for different cluster sizes and DoF per cell in Fig. 5.33. First, this plot shows that less memory is saved for smaller cluster sizes. We account for this by clusters with less cells requiring the same amount of RLE meta information compared to clusters with more cells. Hence, for smaller clusters, the benefit of the ratio of cells to RLE encoding is getting smaller. Second, for larger cluster sizes, the plot shows the potential of memory savings for simulations with a small number of DoF per cell. According to the simplified model, this can reduce the memory consumption up to a factor of 4.5. Third, with a larger number of DoF per cell, as it is the case for higher-order simulations or by storing regular sub-grids for each cell, the benefit of our stack- and stream-based simulations with RLE meta information is below 1.5 for more than 32 DoF per cell. Though, the advantages of the RLE-based communication are still given e.g. by the block-wise communication and cluster-based data migration for distributed-memory systems (Section 5.10.3).