These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

5.2 Inter-partition communication and dynamic meta information

With partitions generated by the SFC intervals, we present communication patterns by generating and synchronizing data shared interfaces of partitions with a replicated data scheme. First of all, a replicated data scheme requires independent communication buffers. To account for these replicated data scheme, we use additional stack-based communication buffers for each partition. We continue with a static number of partitions and refer to Section 5.5 for dynamic partition generation.

5.2.1 Grid traversals with replicated data layout

For our Sierpiński stack- and stream-based traversals, we first describe the replicated data layout in an abstract way. It can be implemented with a forward- and backward-traversal in the following way:

(a)
Forward traversal:
During the forward grid traversal of partition Pi, the edge types of all shared interfaces on the partition are set to new. With our grid traversal based on recursion and inherited edge types, the edge types for the shared interfaces can be directly set to new and old at the sub-tree’s root node.

By executing an SFC-based grid traversal, communication data is written to the corresponding edge- or vertex-communication stacks. Since those output buffers are replicated, this access is race condition free. Now, the output stacks are filled with communication data of dPi.

(b)
Synchronization:
Working on replicated data, we execute a reduce operation for the data stored at shared interfaces to synchronize the replicated data on the communication buffers. Information on the placement of shared interface data dPi in memory buffers is provided in Section 5.2.3.
(c)
Backward traversal:
Finally, a backward SFC traversal is reading the data from communication stacks by setting the edge types of shared interfaces to old. This results in reading the communication data on which the reduce operation was executed on.

A parallelization with a replicated data scheme allows the forward traversals being executed on all partitions in parallel and also in arbitrary order. The same holds for the synchronization operation and backward traversals. Therefore, the only required access-synchronizations are between the forward, reduce and backward traversals. So far, this is a similar approach which was also taken in [Vig12].

5.2.2 Properties of SFC-based inter-partition communication

We next discuss important properties by using the Sierpiński SFC with a stack- and stream-based communication.

Lemma: 5.2.1 (Order of replicated data) After the first traversal, the elements on the communication stack are ordered with their creating SFC cell indices.

Proof:
This theorem directly follows from Theorem 4.7.2 on page 166 due to correct order of the elements on the communication stacks also during the grid traversal. _

With communication data ordered with the SFC cell indices, this also leads to additional and for our development mandatory properties regarding the cardinality and uniqueness of data exchange:

Theorem 5.2.2 (Unique adjacent partition) All communication data for an adjacent partition Pk are consecutively stored on the communication stack. This induces an unique adjacency of partitions.

Proof:
The proof is given by reductio ad absurdum with the communication element order from Lemma 5.2.1. Let at least two consecutively stored non-empty communication data sets S1 and S3 on the communication stack shared with an adjacent partition Pa and a set S2 associated to Pb be given. S1, S2 and S3 are consecutively stored on the communication stack. Further, let the communication data consist out of SFC-ordered cell indices. Then, there has to be at least one partition Pb with ab accessing the elements stored between the communication elements which are stored for partition Pa on the communication stack. However, this leads to a contradiction to Lemma 5.2.1, page 207: with all cell indices from Pa within a particular SFC interval, the cell indices from Pb then have to be within the range of Pa. With our grid partitioning approach (see Section 5.1.1), this is not possible due to consecutive intervals (without gaps) assigned exclusively to each partition. _

5.2.3 Meta information for communication

So far, we assume the knowledge on which blocks of data stored on the communication stacks are associated to which partition to be already given. We refer to this knowledge as communication meta information.

We can store this information per partition and not per cell due to two properties given by the SFC stack-based communication:

1.
The adjacency information per cell is not required for inner-partition hyperfaces due to stack-based communication.
2.
We can store and manage data on shared interfaces Ii,j efficiently with a run-length encoding per partition which we present next:
Run-length encoded adjacency information:

We continue with a description of the meta information for communication and how it is managed. Searching for adjacency data for each communication element can be time-consuming if iterating over all adjacency information stored for all shared hyperfaces or stored per cell. For node-based communication with each node adjacent to up to 8 cells, managing adjacency information can be also very memory demanding.

We present a solution based on both previously derived theorems:

[ERRATA / COMMENT: We developed this proof to assure that the RLE connectivity information for our clustering approach (developed in 2011) works properly in all cases for data exchange, for updating the RLE connectivity information via edge-split/merge adaptivity markers, to infer the new connectivity information during the cluster generation, etc. and finally to proof that the possible extension to SFC cuts really works (see http://www5.in.tum.de/~schreibm/private/thuburn/extracted/dynamic_adaptive_grids/2012_06_13_lst_treffen.pdf). In the context of solely focusing on optimizing communication for MPI send/recv, a similar proof can be also found in "Space-filling curves", Michael Bader, Sec. 10.6.2. ]

This allows introduction of a run-length-encoded (RLE) representation of the adjacency information.


pict

Figure 5.4: Example for an RLE edge communication meta information. The domain is partitioned via subtrees of the spacetree into partitions A-F. Here, we discuss the RLE edge meta communication information for partition B. Regarding the left edge communication stack, three edge communication data elements (a,b,c) are stored to the left edge communication stack after the first forward traversal. These shared edges are adjacent to partition A and encoded with the RLE (A,3). For the right communication stack and following the SFC induced edge access with the forward traversal in partition B, the first adjacent partition is E via two edges, hence using the RLE entry (E,2). This is followed by partition C with only one edge which is encoded with RLE entry (C,1).


pict

Figure 5.5: An example of a sparse communication graph representing our RLE meta information for edges. Each graph edge represents one entry in the RLE meta information.

We first consider one-dimensional shared interfaces (edges) only and the information on communication edges for adjacent partitions A, B and C, respectively, given by 2, 4 and 3 shared edges. Without RLE, we can store the meta information with the tuple (A,A,B,B,B,B,C,C,C). Using our RLE, we represent m entries for edge communication referring to the same adjacent partition P with the tuple R := (P,m). We can compresses this with an RLE scheme to ((A,2),(B,4),(C,3)). Figure 5.4 gives an example of such an RLE edge communication for a single partition and Figure 5.5 shows the underlying representation of the sparse edge communication graph.

For the zero-dimensional hyperfaces (nodes), we can extend the RLE representations for the edges directly accounting for nodes: We first require that no zero-length encoded communication information for edges may be stored. This allows us to use RLE elements with m = 0 to describe a vertex shared with an adjacent partition which was not considered by the edge-based communication so far.


pict

Figure 5.6: Sparse communication graph for edges and nodes. Left image: grid representation. Right image: communication graph for the given underlying grid. Solid line graph edges represent communication for edges whereas a dashed graph represents communication for nodes of partitions not represented by RLE for edge communication meta information.

To give an example, we assume adjacent partitions A, B and C, sharing hyperfaces setup with 2 edges, 1 vertex and 3 shared edges, respectively. Storing the information on edges and vertices separately would result in edge encodings (A,A,C,C,C) for the edge communication information and (A,A,A,B,C,C,C,C) for the vertex communication information. Unifying both representations with tuples and additional markers e and v representing edges and vertices yields ((A,e),(A,e),(B,v),(C,e),(C,e),(C,e)). Here, we assume that two consecutive and identical entries also account for a vertex. Using our RLE scheme, this meta information can be further compressed to ((A,2),(B,0),(C,3)). The underlying sparse communication graph is given in Fig. 5.6. This RLE meta information is stored separately for the left and right communication buffers.

We summarize the main benetifs of using such an RLE compared to meta information stored for each shared hyperface:

(a)
Less memory is required to store adjacency information.
(b)
We can use block-wise communication for shared- and distributed-memory communication.
(c)
We can implement an efficient implicit management of RLE meta information for our adaptivity traversals (see e.g. Section 5.2.6).

By considering the communication of cell ids, the SFC traversal at the adjacent partition generates the data on the communication stacks in reversed, descending, order. This leads to the requirement of reversing the order of elements transferred block-wise.

5.2.4 Vertices uniqueness problem

So far, we ignored the vertices at the cell touched at first and last during traversal of a partition. The data associated to this vertex can be stored to either the left or right communication stacks. We consider two solutions for this uniqueness problem:

We decided to use the modification of the SFC traversal for the present work: with our recursive subtree traversal, we override the underlying SFC traversal grammar G on the subtree’s root node to force the placement of the first and last vertex to the left or right communication stack. Then, we use this uniqueness to derive the knowledge on the placement of the vertices on the communication stacks.

5.2.5 Exchanging communication data and additional stacks

So far, we know the amount of data and from which adjacent partition to read the data from. With our parallelization based on the replicated data scheme (see Sec. 5.1.2), we use separated buffers for the inter-partition shared hyperfaces. However, exchanging data with adjacent partitions using the same stacks for receiving data as for writing data leads to race conditions. Therefore we extend the stack system with buffers storing the exchanged or, in combination with the communication buffer, the reduced data in case of a reduce operation. This additional exchange stack is then used by the backward traversal instead of the communication stack during the forward traversal. Such additional exchange stack requirements lead to a duplication of each communication stack due to our replicated data scheme.

We further differentiate between shared- and distributed-memory data exchange:

5.2.6 Dynamic updating of run-length-encoded adjacency information

The communication to/from cells adjacent via hyperfaces can be accomplished with our communication stack system and the RLE adjacency information. Due to our dynamically adaptive grids, this adjacency information has to be updated appropriately for dynamically changing grids. To avoid a reconstruction of the meta information, e.g. based on the recursive spacetree traversal, we use additional information stored on the communication stack during our last adaptivity traversal.

Instead of running only the last backward adaptivity traversal to refine and coarsen cells, we also transfer additional information on inserted and removed edges via the edge communication stacks. To generate data for the partition boundary dPi, we set the edge types of the partition boundaries to new and forward the following markers via edges:

After the grid traversal, the left and right edge communication stacks then store adaptivity markers on split (MR: inserting a vertex) and joined (2 × MC: removing a vertex) edges for the partition boundary dPi. This allows us to update the RLE meta information of the modified grid based on these markers only. An example is given in Fig. 5.7.

Updating the left and right RLE meta information is then accomplished by iterating over the respective adaptivity communication stack to which the adaptivity markers were written to. These markers describe the change in the RLE meta information due to the adaptivity step.

Algorithm: Updating communication meta information

-



pict

Figure 5.7: Updating meta information for the communication being based on adaptivity refinement (MR = R) and coarsening markers (MC = C). Note that the adaptivity markers are stored in backward traversal direction. For a single refinement marker R, the corresponding RLE entry has to incremented by 1 to account for the inserted edge. Reading two coarsening markers C from the adaptivity communication stack, this is a representative for a removed single edge. Therefore the RLE entry is decremented by 1 [SBB12].

Updating the vertex communication data is also transparent to this adaptivity process. By adding or removing edges via updating the RLE, this also accounts for updating the vertex-based communication information due to three properties:

Special care has to be taken for partitions consisting only of a single cell since a coarsening operation would lead to a partition only consisting of half a cell. These coarsening operations must be deactivated by invalidating the coarsening state on the adaptivity state stack.

The presented communication schemes are also applicable to partitions generated by SFC cuts.