These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

5.10 Distributed-memory parallelization

Our distributed-memory parallelization approach takes advantage of our cluster-based software design with a replicated data scheme also for shared-memory parallelization. This also accounts for the two major issues for distributed memory systems: (a) efficient block-wise communication via RLE meta information and (b) efficient data migration with the cluster-based software design. Furthermore, our software design makes the extensions for distributed memory almost transparent for a framework user since only minor additional requirements for the framework user are induced such as the data migration of user-specified data.

Support for distributed memory can then be accomplished with the following extensions to the framework design (Fig. 5.10):

Simulation driver:
The simulation driver is responsible for controlling the overall simulation. This includes e.g. the determination of the amount of workload required for load balancing and the global time-step size. Therefore we extend the simulation driver with e.g. min-reduce operations on the time step sizes of each rank to compute the correct global time-step size.
Inter-cluster communication:
The data associated to hyperfaces which are shared with clusters of other ranks has to be sent and received in a correct way (Section 5.10.1).
Dynamic cluster generation:
The dynamic cluster generation has to be extended to distributed memory to support updating the RLE communication meta information for splits and joins of adjacent clusters (Section 5.10.2).
Cluster-based data migration:
Each cluster is extended with instructions for transferring its meta-, stack- and user-specified data to another rank. Also updating the RLE communication meta information has to be considered (Section 5.10.3).
Base triangulation:
Not all base triangulations are valid anymore since our communication schemes for distributed memory relies on the SFC order of communication data (Section 5.10.4).

Considering the components relevant from an algorithmic point of view, the distributed memory parallelization is then based on items (2)–(5) which are further discussed in detail.

5.10.1 Intra- and inter-cluster communication

Regarding the cell communication, requirements on (a) intra- and (b) inter-cluster communication are discussed next:

Intra-cluster communication:
We can use the property of the data exchange between cells being accomplished via the stack system. Communication via this stack system does not depend on any explicitly stored adjacency information. Therefore, our stack-based communication method is invariant to the memory location5 and rank. This leads to an intra-cluster communication not depending on the memory- and rank-location of the cluster.
Inter-cluster communication:
For inter-cluster communication, only a minor modification to support distributed memory is required. The RLE communication meta information about adjacent clusters is extended by the information on the adjacent rank with which to exchange the data.

Regarding the inter-cluster data exchange, we use non-blocking send operations to transfer the replicated data to the rank of the adjacent cluster. Using our RLE, this data transfer can be obviously accomplished directly block-wise, e.g. without gathering of smaller chunks of data to be transferred.

We further use MPI send-recv tags to label each communication with the communication data being either stored to the communication buffers in clockwise or counter-clockwise order of the SFC close to the inter-cluster shared hyperfaces. This is required for base triangulations consisting e.g. only of two triangles with periodic boundary conditions. This avoids the communicated data stored for clockwise edge directions being read for edges with counter-clockwise direction and vice versa.

Depending on the method of data exchange on distributed-memory systems, our approach can further demand for a particular order and tagging. To receive the data blocks in correct order, the send operations follow the SFC order of the clusters and the receive operations are then executed in reversed SFC order. Also the RLE meta information entries have to be iterated in reversed order to consider the opposite direction of the traversal of the clusters at the other rank (see Section 5.2.5 for information on reversing communicated data). Reversing these receive operations assures the correct order.

5.10.2 Dynamic cluster generation

For load-balancing reasons, it can be advantageous to split a cluster in case that its one-dimensional SFC interval representation can be assigned to several cores.

The cluster generation approach considered in this work is based on the rank-local number of cells only. This makes the cluster generation approach independent of the global load distribution. An extension with a global parallel prefix sum based on the number of cells [HKR+12] to generate cluster in a global-aware manner is not further considered in this work.

After splitting and joining of clusters, we also require reconstruction of the communication meta information to account for possible splits and joins of adjacent clusters. Updating local RLE meta information that is associated to a cluster stored at another rank is not possible anymore with our shared-memory approach, see Sec. 5.5.3. This is due to a missing direct access to the meta information of the adjacent cluster. Therefore, the adjacent cluster is responsible for deriving and sending the required split/join information. The local cluster then receives and uses this information to update the RLE meta information accounting for adjacent split/join operations.

5.10.3 Cluster-based load balancing

With dynamically adaptive grids, this results in load imbalances across several ranks. Data migration is one of the typical approaches to migrate workload to another rank. With our cluster-based data migration, we present a highly efficient method for migrating a set of clusters. This efficiency results from three major components of our software design:

The migration algorithm of our cluster then consists of the following steps:

Destination labeling:
We label each cluster with the rank to which it has to be migrated to.
Prospective update of communication meta information:
For each RLE meta information entry and in case that this meta information represents a communication with a cluster stored at another rank, send the destination rank to the adjacent rank, otherwise use ϵ to represent no migration. Then the local RLE meta information about the adjacent cluster is updated in case of a migration of the adjcent cluster to another rank. This assures that each adjacent cluster can update its meta information to the new destination of the migrated cluster.

For efficiency reasons, we joined this step with updating the RLE information during forwarding the information for dynamic clustering (Section 5.10.2).

Cluster migration:
The cluster data (stacks, meta and user information, …) is then migrated to the adjacent rank, see Section 5.10.3. This migration already includes the updated communication meta information from the previous step.
Synchronize the cluster-local RLE information:
We distinguish between received and send clusters:

After receiving all clusters at a rank, the pointers to the adjacent clusters which are stored on the same rank have to be recovered, e.g. for accessing meta communication data. By also keeping the cluster-unique id in each RLE meta information entry, we can find the clusters efficiently with the cluster set based on a binary tree structure. The pointers from the adjacent cluster to the currently processed one can then be directly corrected by writing to the RLE meta information of the adjacent cluster.

We emphasize that these writing operations to adjacent clusters violate our push and pull concept (no writing access to adjacent clusters). However, we consider this to be the most efficient way to update the communication meta information for clusters stored on the same memory context.

For clusters migrated away from the considered rank, the RLE meta information of the adjacent clusters on the same considered rank have to be updated. These entires are set to the new rank to which the cluster was migrated to. This new rank is available with the destination rank label, see the first item (1) of this list.

The information where to send the clusters is derived based on the local and MPI parallel prefix sum on the number of cells in each cluster. Then, each rank can determine the current range of cells within the global number of cells of the entire simulation. The clusters are tagged with a rank which leads to improved load balancing. Such a load balancing can be e.g. improved by sending the clusters only to the next or previous rank. This is advantageous, since only the next and previous rank have to test for a cluster migration from an adjacent node. For severe load-imbalances, such a migration is not feasible anymore since only an iterative data migration can solve such a load imbalance. Hence, we can tag the cluster to be directly sent to the rank which should own the cluster for improved load balancing. Here, the rank is computed by

        ⌊      Wi⌋
rank :=   Ri +-2--
with Ri the global start id of the first cell in the cluster, Wi the workload in the cluster and Wavg the average number of cells per rank (cells). This can be interpreted as extension of the range information presented in Sec. 5.6 to distributed-memory systems. Then, an all-to-all communication is used to inform ranks to receive one or more clusters from a rank. Further details on such a load balancing can be e.g. found in [HKR+12]. Finally, we like to emphasize, that the data migration of clusters still has to conserve the SFC order of the clusters on all MPI ranks.

Since our cluster migration is based on setting the destination rank followed by the cluster migration, this allows implementing a generic load-balancing interface [DHB+00] and, thus, also different well-studied load-balancing and migration strategies, see e.g.  [ZK05,Cyb89,Hor93].

5.10.4 Distributed base triangulation


Figure 5.35: Possible domain triangulation for simulations with a shared-memory parallelization, forbidden for distributed-memory runs

With our communication scheme relying on the correct SFC order of the underlying grid cells, we are not allowed anymore to generate domain triangulations such as the one given in Fig. 5.35. In general, traversals with an initial base triangulation and an SFC traversal close to shared edges in the same direction (see e.g.  [NCT09] for a cubed sphere SFC traversal with the Hilbert SFC) result in inconsistent communication schemes.

A different view of a correct communication order for base triangulations on distributed-memory systems is the possibility of embedding our grid into a (regularly) recursively structured grid based on the Sierpiński SFC. This consequently leads to the space-forrest assembled by the nodes on a particular level of a regularly refined spacetree. Examples for invalid and valid base triangulations with the corresponding embedding into a regular refinement are given in Fig. 5.36.

pict   pict

Figure 5.36: Different base triangulations. Left image: Invalid base triangulation with the middle two triangles not embeddable to the regularly refined grid. Right image: valid base triangulation.

The initial association of base triangles to ranks is accomplished by aiming for balanced workloads. Embedding our base triangles into an SFC generated grid, we can again use the 1D representation to assign clusters to ranks. Each rank then initializes the clusters assigned to it. Regarding the cluster tree, each rank removes all leaf nodes not assigned to the current rank and also the inner nodes which do not have any further child nodes.

5.10.5 Similarities with parallelization of block-adaptive grids

We continue by highlighting similarities between clustering and dynamically adaptive block-adaptive grids.

Hence, a cluster-based parallelization approach allows similar optimizations such as hybrid parallelizations, latency hiding, etc. with some of them being also further researched in this work.