Cluster-based optimization

These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

[next] [prev] [prev-tail] [tail] [up]

5.8 Cluster-based optimization

We continue with an overview of possible optimizations:

With clustering based on a replicated data scheme, reduce operations are required in order to synchronize the replicated data. This synchronization can be achieved via duplicated reduce operations executed on each cluster or a single reduce operation per shared hyperfaces by considering the SFC order of the clusters. Details on this optimization are given in 5.8.1.
We expect performance improvements by cluster-based local-time stepping (see Section 5.6). With communication interfaces (sequentially stored and duplicated) similar to dynamic adaptive block-structured grids in the Peano framework [UWKA13], this would yield a similar algorithm, and thus we did not focus on such an implementation.
Considering iterative solvers, we can skip computations on clusters. This allows implementation of the local relaxation method [Rüd93] with multiple clusters per compute unit.
With grid traversals only required for a particular spatial area, traversals on clusters which are out of this area can be directly skipped during traversals. The bounding triangles for our tree-split based clusters is directly available in the meta information which is stored for each cluster. Thus, optimizations such as software-controlled frustum culling [GBW90,LP01] are possible. We optimized the sampling of simulation data at particular points (e.g. buoy data for Tsunami simulations or values for analytical benchmarks) by skipping clusters which do not cover the sampling point, see e.g. Section 6.3 for an application.
One of the main building blocks of this framework is a conforming grid within each cluster. The consistency traversals for a conforming grid generation are typically executed by traversing the entire grid. However, for clusters already in a conforming grid state and without any edge creation requests from adjacent clusters, no additional adaptivity traversals would be required. More information on this optimization is given in Section 5.8.2.
With the parallelization typically getting stuck during the output of simulation data, we can use a threaded parallelization model and a dedicated writer task or thread. This task/thread stores the output data to persistent storage whereas the other threads can continue running computations. These optimizations are presented in the Section 6.4.1.

5.8.1 Single reduce operation for replicated data

Using the replicated communication scheme (see Section 5.2), there are basically two ways to reduce the replicated data:

The first one is accessing the replicated data of the adjacent cluster in a read-only manner and executing the reduce operation twice for each cluster. This strictly follows a distributed-memory parallelization with the replicated data sent to the adjacent cluster, followed by a reduce operation.
We present an alternative approach by executing the reduce operation only once for the replicated data.
First, we have to make a decision which cluster should compute the reduction. We can determine such a cluster which is responsible for computing the reduction by considering the SFC order of the first cell in each cluster or the order of the unique ids assigned to each cluster since they also follow the SFC order (see Section 5.3.4). In our development, we use the cluster’s unique ids to determine the “owner” of the reduction operation.
Second, the owner computes the reduce operation and stores the reduced data which is to be processed by the adjacent cluster to the communication buffer. Thus, this range of the communication buffer does not contain anymore the data which is read for the reduction, but the already reduced data. An additional synchronization is required to the next step, which reads the reduced data to avoid race conditions.
Third, the “non-owner” can fetch the data from the communication buffer which is already reduced e.g. by a flux computation and the data can be processed avoiding duplicated reduce operations.

We conducted two experiments based on a shallow water simulation with a radial dam break, 18 initial refinement levels and up to 8 additional levels of adaptivity. Since we assume that the benefits of avoiding duplicated reduce operations depend on the costs of the flux solvers, we run two different benchmarks - each one with a different flux solver: the relatively lightweight Rusanov solver and the heavyweight augmented Riemann solver from the GeoClaw package [Geo08].

The saved computations also depend on the number of shared cluster interfaces, hence we executed benchmarks parametrized with different split thresholds. The results are based on a single-core execution and are given in Fig. 5.28 and they show, that possible performance improvements depend on the utilized flux solver.

pict

Figure 5.28: Performance speed up and speed down if avoiding duplicated reduction of replicated data.

1.: For the lightweight flux solver, avoiding reduce operations results in a performance decrease for smaller clusters. An explanation for this effect is given by the threading overheads: since we require additional synchronization operations, there is an additional overhead introduced by executing additional tasks. For the Rusanov flux solver, the synchronization overhead is higher than the benefits.
2.: Using heavyweight solvers, we see a speedup for optimization and assume that the computational intensity of the solver outperforms the tasking overheads for the additional synchronization.

To verify these assumptions, we develop a simplified model based on the following quantities:

d: We relate the number of cells in the cluster to the subtree’s refinement depth d.
o: This accounts for the tasking overhead costs per cluster, e.g. the computations to dequeue and start a task.
e: Computational time per edge.
c: Computational time per cell.
b: The number of cluster boundaries which we approximate to linearly depend on d. We approximate this with 2d.

Since the computational speedup can also be computed for each cluster, our model only considers the computational costs for a single cluster. These costs for a cluster with a single reduce operation are then given by

C (single)(o,e,c,d) := o + 2d ⋅c + 1.5-⋅2d ⋅e Additional◟ s◝y◜n◞ch. overhead Comput◟a◝ti◜o◞ns for cells Compu◟tat◝io◜ns fo◞r edges

whereas the costs with two reduce operations per shared hyperface are given by

d d C (double)(o,e,c,d) := 2◟◝◜⋅c◞ + 1◟.5-⋅◝2◜-⋅e◞ + 2◟d◝◜⋅e◞ . Computations for cells Computations for edges Additional comp. for edges

pict

Figure 5.29: Performance speed up and speed down if avoiding duplicated reduction of replicated data.

We compute the speedup using the single evaluation with C(double)-C(single)
C(double) and use rule-of-thumb values o = 1000, e.g. 1000 cycles to dequeue a task from the worker’s queue and c = 100 as the computational workload per cell. The speedup graph for different refinement depths and thus cluster sizes is given in Fig. 5.29 for different computational costs for edges.

The shapes of our experimentally determined results in Fig. 5.28 match to our model. We find the agreement, that in case of a low computational cost involved in each edge, there’s no benefit in using a single reduce operation. For computational intensive edge reductions as it is the case for the augmented Riemann solver, using a single reduce operation can be beneficial. However, as soon as the number of cells in each cluster is getting higher, the benefits tend towards zero.

We conclude, that for small cluster sizes, avoiding duplicated reduction operations can be beneficial for small cluster, e.g. required for strong-scaling problems and the algorithm presented in the next section.

5.8.2 Skipping of traversals on clusters with a conforming state

So far, we only considered adaptivity conformity traversals being executed by traversing the entire domain grid. For parallelization, we decomposed our domain with clusters and replicated data scheme to execute traversals massively parallel also in a shared-memory environment. For adaptivity traversals aiming at creating a conforming grid, the cluster approach is similar to a red-black coloring approach with information on hanging nodes possibly not forwarded to adjacent clusters in the current adaptivity traversal. Thus more traversals can be required in case of parallelization.

However, a clustering approach offers a way to compensate these conformity traversals if there are more than one clusters per compute unit: adaptivity traversals can be skipped on clusters that (a) already have a conforming grid and (b) do not receive a request of an adjacent cluster to insert an edge (adaptivity marker M_R). Both requirements are discussed next:

(a): To determine whether the adaptivity states already lead to a conforming cluster, we tag each cluster with a “non-conforming” boolean which is set to false before any adaptivity traversals. During the adaptivity traversals, as soon as information on inserting an edge is not propagated with the current traversal, this boolean is set to true. Such information on non-propagated hanging nodes is directly given by the forward information bits f (see Section 4.10.3): If these bits are not equal to zero, adaptivity information still has to be propagated by an additional traveral.
(b): Testing for adaptivity request propagations from adjacent clusters is accomplished by checking the left and right adaptivity communication buffers stored at the adjacent cluster which store adaptivity markers M_R for corresponding requests. In case that all values are not set to M_R, no adaptivity edge-insertion requests are propagated from the adjacent cluster.

This algorithm is based on the sparse graph representation (see Fig. 5.5 of our clustered domain: using this graph representation, our algorithm can be represented by a Petri net [SWB13a].

Algorithm: Skipping of clusters with a conforming adaptivity state

Here, each cluster is a node and the directed arcs exist for each RLE entry. Tokens represent requests for additional conformity traversals. We then distinguish between local consistency traversal tokens and adjacent consistency tokens with the latter one forwarded to, or received from adjacent clusters.

Local consistency tokens:
These tokens are generated by local consistency traversals (see (a) above) and are injected to each cluster’s graph node in case of the cluster still not in a conforming grid state. These tokens are removed after each traversal, e.g. by a transition requiring one or more tokens and firing no token.
Adjacent consistency tokens:
Tokens are fired via the connectivity graph edges to an adjacent cluster in case that a refinement marker is stored on the corresponding RLE encoded communication data. In case that such an adjacent consistency token is received from an adjacent cluster, an additional conformity traversal is started.

______________________________________

Results for short-term simulations:

pict

Figure 5.30: Overview of scalability using different cluster generation strategies with adaptive conforming cluster skipping enabled. The legend is given in the format “scheduling strategy / cluster generation” [SWB13a].

We conducted several runs for the SWE benchmark scenario from Sec. 5.7. The results are given in Fig. 5.30 with the linear scalability shown for the implementation without adaptive conforming cluster skipping. In compliance with the results for the non-skipping version, the range-based cluster generation suffers of additional overheads due to more frequent cluster generation operations. However, considering the pure task-flooding with a threshold-based cluster generation, we get a higher performance compared to the basic implementation due to algorithmic cluster-based optimizations.

Detailed statistics on the skipping algorithm:

pict

Figure 5.31: Comparing run time of skipping and non-skipping of conforming cluster. Here, we distinguish between the cluster generation time (Cluster) which is negligibly small, time for adaptivity tarversals (Adaptivity) and time to run the computations for the time step (Simulation) [SWB13a].

We further analyze the robustness of the adaptive cluster skipping and detailed statistics on the run time separated into the three major phases of a single simulation time step: (a) generating clusters, (b) adaptivity traversals and (c) time step traversals. The results are given in Fig. 5.31 for a different number of threads executed on the same benchmark scenario. For all tested numbers of threads, the skipping algorithm yielded a robust performance improvement. The time spent for cluster generation is negligibly small. We account for that by the required decreased time step size due to the higher-order spatial discretization. This also leads to less changes in the grid and therefore less executions of cluster generation. The presented adaptive cluster skipping also leads to robust performance improvements on distributed memory simulations, see Sec. 5.12.1.

5.8.3 Improved memory consumption with RLE meta information

Our run-length encoding provides an optimization for the blockw-wise data communication for stack-based communications. This section highlight the potential of saved memory storage by using our RLE meta information.

Improved memory consumption results based on simulation

With statistics gathered for executed simulations, we compare the number of RLE-stored meta information to the number of elements required if we would store the communication primitive separately [SWB13b].

Here, we assume the meta information on vertices associated to one RLE edge communication entry is not stored, but inferred by the per-hyperface stored edge meta information. To give an example, we assume an RLE encoding ((A,2),(B,0),(C,3)) to be stored with ((A,e),(A,e),(B,v),(C,e),(C,e),(C,e)), see Sec. 5.2.3. We then compute the ratio between the quantity of entries required for our RLE scheme Q_R := |((A,2),(B,0),(C,3))| to the number of non-RLE encoded entries Q_N := |((A,e),(A,e),(B,v),(C,e),(C,e),(C,e))|. This ratio Q := QN-
QR then yields the factor of memory saved with our RLE meta information compared to storing the meta-communication information separately for the communication hyperfaces.

We conducted empirical tests with a hyperbolic simulation based on (a) the Euler and (b) the shallow water equations. Here, we used edge-based communication for the simulation and node-based communication for a visualization in each time step. The simulation is started with an initial refinement depth of d = 6 and up to a = 14 additional refinement levels. The simulations are initialized with the same radial dam break, using the density as the height for the Euler simulation, and we simulate 100 seconds. This resulted in execution statistics given in the table below:

	time steps	min. number of cells	max. number of cells

Euler	31	34593	40228

Shallow water	733	34593	178167

pict

Figure 5.32: Ratio of entries for RLE meta information to non-RLE meta information for different cluster sizes for the simulation based on the shallow water and Euler equation. The error bars show the minimum and maximum savings of RLE meta information over the entire simulation time [SWB13b].

We plot the ratio Q for our simulation scenario above in Fig. 5.32. Larger clusters lead to an increased ratio due to more shared hyperfaces per cluster and hence more entries to be compressed with our RLE meta information. The min/max bounds for the largest tested cluster size with the Euler simulation is not as large as for the shallow water equation. We account for that by two statistical effects: the first one is a shorter run time which leads to less possible values involved in the min/max computation. The second one is, that less time steps for the Euler simulation also lead to less cells generated in average with a maximum of 40228 cells over the entire simulation (see the table above). This leads to only a low average number of clusters during the entire simulation, almost no dynamic cluster generation in the simulation and thus less values involved into the min/max error range. For typical cluster sizes of more than 4000 cells, the memory requirements for the meta information are reduced by factor of 9.

RLE vs. per-cell stored meta information

For simulations executed on unstructured grids, the meta communication information is typically stored for each cell. This is considered to be especially memory consuming for meta information on vertices due to more than a single adjacent cell compared only a single adjacent cell with edges. However, the ratio of the meta information overhead per-cell compared to our cluster-shared-hyperface RLE encoding depends on the number of degrees of freedom stored in each cell which we further analyze with a model [SWB13b]. Here, we model the payload (DoF) per cell with W and assume a regular refinement depth d, yielding 2^d cells per cluster. We also assume that the non-RLE meta information takes the same amount of memory as a single DoF. Furthermore, we neglect the requirement of storing the cell orientation since this can be bit-compressed and stored in a cell requiring less storage than a single DoF. Such a cell orientation can be required if accessing the adjacent cell.

pict With the grid generated by the bipartitioning Sierpiński SFC and based on our assumption of a regular grid, the required memory for the meta information on adjacent cells for each triangle cell and the workload for each cell are given by

S := 3 (3 edge adjacency information ) + 2 ⋅5+ 1 (vertex adjacency information ) + W

with 3 edge-adjacent cells and 2 ⋅ 5 + 1 required meta information on the per-vertex additional cells, see right handed figure.

Storing the meta information only for the cluster boundaries, this yields R := 2 ⋅ (3 + 2 ⋅ 5 + 1) RLE entries to store the adjacent information with the factor 2 accounting for the two-element tuple for each entry.

The ratio of the number of cells times the memory requirements to the RLE encoded memory requirements then yields the memory saved for each cluster:

--S-⋅2d--- (3+-2-⋅5-+-1+-W-)-⋅2d 2d→ ∞,L.H. 3-+-2⋅5-+-1+-W-- 14- R + W ⋅2d = R + W ⋅2d = W = W + 1

We get an upper bound of 14
W-

+ 1 which only depends on the per-cell payload W.

pict

Figure 5.33: Ratio of saved memory by using our RLE encoding compared to per-cell stored adjacency information [SWB13b].

We plotted this ratio for different cluster sizes and DoF per cell in Fig. 5.33. First, this plot shows that less memory is saved for smaller cluster sizes. We account for this by clusters with less cells requiring the same amount of RLE meta information compared to clusters with more cells. Hence, for smaller clusters, the benefit of the ratio of cells to RLE encoding is getting smaller. Second, for larger cluster sizes, the plot shows the potential of memory savings for simulations with a small number of DoF per cell. According to the simplified model, this can reduce the memory consumption up to a factor of 4.5. Third, with a larger number of DoF per cell, as it is the case for higher-order simulations or by storing regular sub-grids for each cell, the benefit of our stack- and stream-based simulations with RLE meta information is below 1.5 for more than 32 DoF per cell. Though, the advantages of the RLE-based communication are still given e.g. by the block-wise communication and cluster-based data migration for distributed-memory systems (Section 5.10.3).

[next] [prev] [prev-tail] [front] [up]