These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.

There is also more information and a PDF version available.

There is also more information and a PDF version available.

For our clusters based on tree splits, we use tree-split and -join operations for the cluster generation.

To split a cluster, we require information on (a) the number of shared hyperfaces along the separating hyperfaces and (b) the association of persistent data to both child clusters.

We achieve inferring this information by stopping our grid traversal at particular positions and implicitly derive the required quantities on the number of hyperfaces and cells based on the communication stacks (meta communication information) and persistent data stack (simulation data). We refer to this information as split information.

For optimization reasons, we can further join the determination of split information with the last
backward adaptivity traversal and present the algorithm based on this backward traversal which is in
reversed direction, see Fig. 5.13. The algorithm is based on the spacetree nodes on the 2^{nd}
level relative to the cluster root tree node and follows the SFC in reversed direction with
(a,b,c,d). Here, we consider partitions (A,B,C) respectively set up by leaf nodes of subtrees
(a,b,c ∪ d).

The application of our algorithm is given in Fig. 5.13 with the left and right communication
stacks terminology used for the last adaptivity (backward) traversal and for grammars of
type even. We denote the stacks for the left and right adaptivity communication stack
with S_{left} and S_{right}, respectively. The cell stack is given by C, and we determine the
number of cells of the parent element by considering the elements stored on the cell stack:
c^{(parent)} := |C|.

The information on the split operation is stored in s_{{left,right}}^{(1)} for the first and in s_{{left,right}}^{(2)}
for the second child. We also require the number of edges shared by both split clusters in s^{(shared)}.
The number of cells associated to the first and second child partition is stored to c^{(1)} and
c^{(2)}.

- 1.
- After traversal of cells in subcluster A, the communication edges on the left stack for the
2
^{nd}subcluster are memorized with - 2.
- After traversal of cells in B, the shared edges are computed by
^{nd}subcluster on the right stack are saved in^{nd}cluster created by the split operation is inferred by - 3.
- Due to the adaptivity traversal, the information on the shared edges s
^{(shared)}only represents the quantity of shared hyperfaces before refining or coarsening the grid based on the adaptivity states. Therefore, the amount of hyperfaces s^{(shared)}has to be updated based on the refinement and coarsening markers M_{R}and M_{C}stored on the adaptivity marker communication stack, see Section 5.2.6. - 4.
- After traversal of cells in C, the information to reconstruct the communication information
is then given with

Clusters with leaf elements stored on level 1 or 2 require to be handled appropriately in case that
subclusters A and B do not exist. E.g. with a domain only consisting of two cells, we can directly set
s_{right}^{(2)} := 1, s_{left}^{(2)} := 1, s^{(shared)} := 1.

With the derived information, we can apply the split operation: The cell data on the stack can be
directly split based on the split information c^{({1,2})}. The traversal meta information (providing e.g. the
starting point of the grid traversal) has to be updated to account for the new traversal start. The
communication meta information of both children can also be determined, based on the information
on the number of shared edges of both children given in s_{{left,right}}^{({1,2})}, s^{(shared)} and c^{({1,2})}. For the
odd grid grammar, the inference of the required information is similar and therefore not further
discussed here.

The presented algorithm only accounts for updating the local cluster information and would lead to inconsistencies to the meta information stored at adjacent cluster. Updating this meta information on adjacent cluster is presented in Section 5.5.3.

The simulation data on the stacks is currently split after the traversal of the entire cluster. However, copying chunks of stack data from the other clusters can result in stream-like copy operations and result in a bandwidth-limited problem. A direct streaming of the persistent simulation data during the last adaptivity traversal to the stacks of the corresponding new child clusters has the potential to avoid this additionally required stream-copy operation but is not implemented, yet.

Using clustering based on tree splits, our cluster-based approach is restricted to joining two clusters only if they share the same parent node in the cluster tree. Joining two clusters is accomplished by

- (a)
- concatenating the cell data storage,
- (b)
- joining both traversal meta information and
- (c)
- joining both communication meta information and removing the RLE information about the edges shared by both child cluster.

So far, we can split and join clusters and update the meta information based on adaptivity markers written to the edge communication stacks. However, we also have to consider possible split and joins of adjacent clusters, e.g. if an RLE element has to be split to account for the associated cluster being split. This requires updating the edge communication meta information. We present an algorithm by modifying this edge communication information to synchronize with the adjacent meta communication information in case of splits or joins. This algorithm is based on local cluster and directly adjacent cluster access only and does not involve any global communication operations.

We continue refering to the cluster for which the RLE communication meta information is updated as the local cluster and a cluster described by a single RLE communication data stored for a local cluster as an adjacent cluster.

Transfer State | Description |

NO_TRANSFER | No split or join was executed on this cluster. |

SPLIT_PARENT | This node was split and both children have the state SPLIT_CHILD. |

SPLIT_CHILD | Cluster generated by a split operation are set to this state. |

JOINED_PARENT | This node was created by joining both children which have state JOINED_CHILD. |

JOINED_CHILD | This cluster represents a former child node which was joined with its sibling. This cluster is not deleted to allow accessing its communication meta information. |

Updating edge-based meta communication is based on flags stored per cluster which describe the split- and join-states of the cluster. All required states for markers of adjacent clusters are given in Table 5.1 and an overview of all relevant cases which have to be considered in Fig. 5.14.

Similar to the exchange of edge communication data, updating the communication information is accomplished by read-only access of adjacent cluster data for shared memory access. Examples for the split-split constellations are given in Fig. 5.15.

The algorithm then consists of the following steps for each of the left and right RLE meta information lists:

Algorithm: Outer loop of RLE consistency

- 1.
- An iterator is set to the first RLE entry in the meta information list.
- 2.
- For each iteration on the next RLE entry, we test for adjacent cluster changes.
- 3.
- In case that neither a split nor a join operation was detected in the adjacent cluster given by the RLE entry, we advance the iterator and continue with (3).
- 4.
- The RLE entry requires additional processing due to possible inconsistencies created by
split/join operations of the adjacent cluster. Since we are updating the communication
information and modifying data in parallel, we have to avoid race conditions. These
race conditions can occur since a processing of other clusters in parallel can lead to
accessing the meta information of this cluster.
If this is the first time of a modification of one entry in the local RLE list during the list iteration, we duplicate the list of local RLE communication meta information and continue iterating on the duplicate. The iterator is set to the same entry in the duplicated data and further operations are only executed on the duplicated data.

- 5.
- We continue by distinguishing different split or join cases. First, a check of the adjacent transfer state is done which can be NO_TRANSFER, SPLIT_PARENT or JOINED_CHILD, respectively, abbreviated with (NOP,SPLIT,JOIN). The second decision is the type of the local transfer state which can be NO_TRANSFER, SPLIT_CHILD or JOINED_PARENT, also respectively abbreviated with (NOP,SPLIT,JOIN). An overview of possible cases is given in Fig. 5.14.
- 6.
- For split transfer states of the adjacent partition, an additional counter remainingCommunicationElements is introduced and set to the number of edges of the currently processed local RLE entry.
- 7.
- Update the local RLE entry with one of the algorithmic blocks below, advance the iterator and continue at 3.

______________________________________

We distinguish the method for updating the RLE meta information depending on the transfer state of the adjacent cluster to decide in which access order to traverse the RLE-related adjacent clusters:

Algorithm: Local RLE entry update - adjacent transfer state: SPLIT_PARENT

- Local transfer state: SPLIT_CHILD:

For this transfer state constellation, we further distinguish between the following cases:- 1.
- The split operations for both adjacent clusters lead to RLE entries with equal cardinality, see e.g. the right image in Fig. 5.15. Then only the information on the new placement (pointer/rank) of the adjacent cluster (this information changed due to the parent’s split) has to be updated.
- 2.
- The RLE meta information is on a split of both adjacent clusters with the quantities not matching, see e.g. left image in Fig. 5.15.

In any case, we have to figure out the traversal order of the corresponding adjacent child cluster nodes.

Whether the first or second adjacent child node is traversed first, plays an important role to append additionally required RLE information in the correct order. This information is given by D with 0 for a traversal in the order “first, then second child” and 1 for “second, then first child”. We initialize our flag with D := 0. The traversal direction is then updated based on the local and adjacent SFC traversal direction on the shared cluster boundary:

- Rule 1) Traversal order of adjacently split children: If the cluster boundary traversal directions of the local and the adjacent parent’s cluster are equal, the traversal order of the adjacent children has to be in the same direction by setting D := 0, otherwise reversed by setting D := 1.
- Rule 2) Traversal order of locally split children: The traversal order is reversed if the RLE of the second local child is updated. This leads to a search of matching RLE entries from the end of the adjacent parent’s meta information.

An example for applying these rules is given below. Once the order of adjacent child access directions is known, the update process can start:

- 1.
- remainingCommunicationElements is initialized with the number of local RLE elements for which the update is done.

If the local cluster is a parent cluster which was created by joining two child clusters, the first adjacent cluster is searched for corresponding RLE elements. In case of a perfect match of the RLE number of shared edges, no update is required. Otherwise, a new RLE entry with the changed quantity of shared edges has to be inserted and the search is continued on the next adjacent child cluster for the remaining matching RLE element.

In case that the local cluster was not modified, rules similar to the SPLIT_CHILD case can be applied. Because of this similarity, these are not further discussed here.

______________________________________

Given the rules introduced in the previous algorithm, we discuss the relevant scenarios given in left image in Fig. 5.15 for cluster b by traversing the adjacent clusters.

Left image:

- Rule 1) The traversal on the cluster boundary of the shared edges of cluster b is counterclockwise and also the direction of e and f. This leads to a reversed child traversal order (D := 1).
- Rule 2) Since cluster b is the first child’s triangle, the adjacent traversal order is not changed.
- With D == 1, this leads to a reversal of the traversal of the adjacent child clusters, a second-first order. This means, that first cluster f is searched for adjacent edge parts, followed by cluster e. Each traversal leads to an insertion of corresponding new RLE entries to the RLE meta information. These entries replace the RLE entry associated to the split parent cluster.

Right image:

- Rule 1) The SFC traversal direction on the cluster boundary of shared edge of cluster b is counter-clockwise and thus not equal to the cluster boundary direction of c which is clockwise, leading to a non-reversed adjacent cluster tree traversal (D := 0).
- Rule 2) Since the triangle b is the first child triangle, the adjacent child traversal order is not changed (D := 0).
- Combining both rules, the child traversal of the adjacent node is done in first-second order. Thus cluster d is searched for adjacent edge parts first, followed by cluster c.

Algorithm: Local RLE entry update - adjacent transfer state: JOINED_CHILD

In case of a joined adjacent child, the local RLE is tested for the possibility to join it with the next local RLE. Such a join is producing a consistent state in case that both consecutively stored adjacently joined children have the same parent. Otherwise, this would result in a violation of our assumption of a unique RLE existing for shared interfaces, see Sec. 5.2.3. Other updates of the RLE meta information can be applied similarly to the split case above.

Algorithm: Local RLE entry update - adjacent transfer state: NO_TRANSFER

With all local RLE entries being updated directly in case of split and join operations, no modifications of the RLE entry are required in case of an adjacent cluster without split/join.

Algorithm: Postprocessing RLE updates

Finally, the possibly RLE meta information which was duplicated to avoiding race conditions (see step 4 in the ’outer loop’ part of the algorithm above) is used as the new RLE meta information.

This algorithm computes a consistent meta information fulfilling all theorems described so far in this thesis.

We like to give a short outlook to distributed-memory parallelization: Here, we use two-sided communication and the read-only push/pull principle: We separate the updates of the meta information into local and adjacent operations. Instead of looking up the required split/join information by accessing the adjacent clusters stored at another MPI node, our updates of the meta information for each local RLE entry rely on a cooperative communication:

All local clusters send the necessary split/join information about their local changes to the adjacent clusters (push). The adjacent clusters can then receive (pull) this data and update the meta communication information similar to the shared memory synchronization. This leads to a straightforward extension of the shared-memory algorithm presented above.

We have not considered the vertex communication meta information for dynamic clustering, yet. After cluster splits and joins, the vertex communication meta information can be in an inconsistent state since zero-length encoded vertex RLE entries can be missing. However, we can reconstruct the vertex communication information based on a valid edge communication from the previous section.

The algorithmic idea is based on the assumption that all clusters sharing the vertex also store one or two edge-related RLE meta information about edge-adjacent clusters (or a boundary) which also share the vertex. By accessing the edge-adjacent clusters, we can continue our search to the other vertex-sharing clusters. Then all clusters sharing a vertex can be traversed via following the RLE edge communication information associated to this vertex. This allows generation of a trace of all clusters sharing the vertex and, based on this trace, a reconstruction of our zero-length encoded vertex RLE meta information.

Without loss of generality, we only show the algorithm with the assumption that both SFC traversal directions along the hyperfaces shared by two clusters are in opposite directions.

For each cluster P_{i}, let the left and right RLE edge communication meta information be given by
R_{k}^{({left,right})} = (Q,m)_{k} with Q the adjacent cluster, m > 0 the number of edges shared with the
adjacent partition and 1 < k < N_{({left,right})} selecting the RLE entry. The cluster associated to the
first entry in the tuple R_{j} = (Q,m) is given by P(R_{j}) := R_{j}[1] == Q.

We use a flag F ∈{front,back} describing whether the vertex for which the meta information is determined
is associated to the front or back of the currently processed edge-related RLE entry. For the sake of convenience,
we formally^{2}
combine R^{({left,right})} to a single ring-like buffer of RLE meta information

Algorithm: Reconstruction of vertex meta information for cluster P_{i}

For all R_{j} ∈ R of the cluster P_{i}, execute the following operations:

- Initialize vertex tracing: We use flag F which is initialized to back since the vertex is
associated to the last edge processed by the RLE edge meta information. The previously
visited cluster is kept in P
_{(prev)}and initialized to P_{i}as the cluster to reconstruct the vertex information for. The next processed cluster then becomes P_{(current)}:= P(R_{j}).In case that P

_{(current)}represents a boundary, we set the flag F to front and continue the search with P_{(current)}:= P(R_{j+1}). If this is also a boundary, the vertex is not shared by any other cluster and we stop the algorithm. - Trace generation:
- 1.
- For the currently processed cluster P
_{(current)}, the communication meta information R_{(conn)}connecting the currently visited cluster to the previous one is determined by conn := {i|P(R_{i}) = P_{(prev)}} - 2.
- Using flag F, we load the next (F = front: R
_{(next)}:= R_{(conn+1)}) or previous (F = back: R_{(next)}:= R_{(conn-1)}) RLE communication meta information to R_{(next)}. The next cluster to visit is then given by P(R_{(next)}).- (a)
- If the adjacent cluster is of type boundary (P(R
_{(next)}) == -1), we consider two cases:(I) In case that this is the second time that we run into a boundary, we visited all adjacent clusters which are connected via edges and the algorithm terminates.

(II) Otherwise, we rerun our trace generation (1) on the start cluster P

_{(prev)}:= P_{i}but traversing in the other direction P_{(current)}:= P(R_{j+1}). The flag F is then set to front. - (b)
- In case of visiting the original cluster P
_{(current)}== P_{i}, we found all clusters connected to the vertex and finish the trace generation. - (c)
- The next cluster is visited by the following steps: First, we set the previously
visited cluster to the current one P
_{(prev)}:= P_{(current)}and set the current cluster to the next cluster to visit P_{(current)}:= P(R_{j+1}). Then we continue with our trace generation (1).

______________________________________

Based on the visited clusters, we can reconstruct the required RLE vertex meta information. For each RLE, at most 8 adjacent clusters which share the vertex are visited. Therefore, this algorithm has a constant runtime for each RLE. Furthermore, the algorithm is based on local (only vertex-adjacent) access operations only. Giving the average number of RLE elements per cluster in avgRLE, the runtime complexity is O(#Cluster ⋅ avgRLE).

We give a concrete example of this algorithm for reconstructing the RLE information for cluster C and
the middle vertex at the top in Fig. 5.17. Here, we start with the RLE entry R_{3} := (A,.) and initialize
the algorithm to P_{(prev)} := C, P_{(current)} := A and F := back.

- Visiting cluster A, we first determine the RLE associated to the previous cluster to
be R
_{3}= (C,.). Since the F is set to back, we load the previous RLE R_{2}, yielding R_{(next)}:= R_{2}= (D,.). - We update the previous and current cluster to process to P
_{(prev)}:= P_{(current)}= A and P_{(current)}:= P(R_{(next)}) = D - Visiting cluster D, we determine the adjacent RLE to be R
_{1}= (A,.). Since F is set to back, we load the previous RLE R_{0}= R_{3}, yielding R_{(next)}:= R_{3}= (-1,-1).Due to the detected boundary, we set F := front, P

_{(prev)}:= C and P_{(current)}:= R_{j+1}= B. - Processing cluster B, the RLE entry associated to the cluster C is given by R
_{1}= (C,.). F is set to front, thus we load the next RLE R_{2}, yielding R_{(next)}:= R_{2}= (-1,-1). Since this is the second time that we run into a boundary, the algorithm terminates.

We also like to mention an alternative way which was not implemented in this work: The information on the adjacently generated vertices can also be inferred similar to the reconstruction of edge meta information, see Sec. 5.5.3. With this method, we can also search on the adjacent split/joined cluster meta information for a possible requirement of inserting or removing vertex RLE meta information.