Dynamic cluster generation

These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

[next] [prev] [prev-tail] [tail] [up]

5.5 Dynamic cluster generation

For our clusters based on tree splits, we use tree-split and -join operations for the cluster generation.

5.5.1 Splitting

To split a cluster, we require information on (a) the number of shared hyperfaces along the separating hyperfaces and (b) the association of persistent data to both child clusters.

We achieve inferring this information by stopping our grid traversal at particular positions and implicitly derive the required quantities on the number of hyperfaces and cells based on the communication stacks (meta communication information) and persistent data stack (simulation data). We refer to this information as split information.

For optimization reasons, we can further join the determination of split information with the last backward adaptivity traversal and present the algorithm based on this backward traversal which is in reversed direction, see Fig. 5.13. The algorithm is based on the spacetree nodes on the 2^nd level relative to the cluster root tree node and follows the SFC in reversed direction with (a,b,c,d). Here, we consider partitions (A,B,C) respectively set up by leaf nodes of subtrees (a,b,c ∪ d).

pict

Figure 5.13: Backward traversal generating split information. The stacks generation direction are given in forward traversal direction while the traversal itself is done in reversed direction during the last backward adaptivity traversal [SBB12].

The application of our algorithm is given in Fig. 5.13 with the left and right communication stacks terminology used for the last adaptivity (backward) traversal and for grammars of type even. We denote the stacks for the left and right adaptivity communication stack with S_left and S_right, respectively. The cell stack is given by C, and we determine the number of cells of the parent element by considering the elements stored on the cell stack: c^(parent) := |C|.

The information on the split operation is stored in s_{left,right}⁽¹⁾ for the first and in s_{left,right}⁽²⁾ for the second child. We also require the number of edges shared by both split clusters in s^(shared). The number of cells associated to the first and second child partition is stored to c⁽¹⁾ and c⁽²⁾.

1.: After traversal of cells in subcluster A, the communication edges on the left stack for the 2^nd subcluster are memorized with $s(r2)ight := |Sleft|$ We like to emphasize again, that the left and right labels are reversed due to the information derived in backward traversal whereas RLE meta information is stored in forward-traversal direction.
2.: After traversal of cells in B, the shared edges are computed by $(shared) (2) s := |Sleft|- sright$ and the number of parent’s inter-cluster edges for the 2^nd subcluster on the right stack are saved in $(2) sleft := |Sright|.$ The processed number of cells which is then associated to the 2^nd cluster created by the split operation is inferred by $c(2) := c(parent) - |C|$ with |C| the remaining number of cells to be processed. This also represents the number of cells assigned to the first subpartition C: $(1) c := |C|.$
3.: Due to the adaptivity traversal, the information on the shared edges s^(shared) only represents the quantity of shared hyperfaces before refining or coarsening the grid based on the adaptivity states. Therefore, the amount of hyperfaces s^(shared) has to be updated based on the refinement and coarsening markers M_R and M_C stored on the adaptivity marker communication stack, see Section 5.2.6.
4.: After traversal of cells in C, the information to reconstruct the communication information is then given with $(1) (2) sright := |Sleft|- sright,$ and $(1) (2) sleft := |Sright| - sleft.$

Clusters with leaf elements stored on level 1 or 2 require to be handled appropriately in case that subclusters A and B do not exist. E.g. with a domain only consisting of two cells, we can directly set s_right⁽²⁾ := 1, s_left⁽²⁾ := 1, s^(shared) := 1.

With the derived information, we can apply the split operation: The cell data on the stack can be directly split based on the split information c^({1,2}). The traversal meta information (providing e.g. the starting point of the grid traversal) has to be updated to account for the new traversal start. The communication meta information of both children can also be determined, based on the information on the number of shared edges of both children given in s_{left,right}^({1,2}), s^(shared) and c^({1,2}). For the odd grid grammar, the inference of the required information is similar and therefore not further discussed here.

The presented algorithm only accounts for updating the local cluster information and would lead to inconsistencies to the meta information stored at adjacent cluster. Updating this meta information on adjacent cluster is presented in Section 5.5.3.

The simulation data on the stacks is currently split after the traversal of the entire cluster. However, copying chunks of stack data from the other clusters can result in stream-like copy operations and result in a bandwidth-limited problem. A direct streaming of the persistent simulation data during the last adaptivity traversal to the stacks of the corresponding new child clusters has the potential to avoid this additionally required stream-copy operation but is not implemented, yet.

5.5.2 Joining

Using clustering based on tree splits, our cluster-based approach is restricted to joining two clusters only if they share the same parent node in the cluster tree. Joining two clusters is accomplished by

(a): concatenating the cell data storage,
(b): joining both traversal meta information and
(c): joining both communication meta information and removing the RLE information about the edges shared by both child cluster.

5.5.3 Split and join updates of meta communication information

So far, we can split and join clusters and update the meta information based on adaptivity markers written to the edge communication stacks. However, we also have to consider possible split and joins of adjacent clusters, e.g. if an RLE element has to be split to account for the associated cluster being split. This requires updating the edge communication meta information. We present an algorithm by modifying this edge communication information to synchronize with the adjacent meta communication information in case of splits or joins. This algorithm is based on local cluster and directly adjacent cluster access only and does not involve any global communication operations.

We continue refering to the cluster for which the RLE communication meta information is updated as the local cluster and a cluster described by a single RLE communication data stored for a local cluster as an adjacent cluster.

Transfer State	Description

NO_TRANSFER	No split or join was executed on this cluster.

SPLIT_PARENT	This node was split and both children have the state SPLIT_CHILD.

SPLIT_CHILD	Cluster generated by a split operation are set to this state.

JOINED_PARENT	This node was created by joining both children which have state JOINED_CHILD.

JOINED_CHILD	This cluster represents a former child node which was joined with its sibling. This cluster is not deleted to allow accessing its communication meta information.

Table 5.1: Different state flags assigned to each node in the cluster tree.

pict

Figure 5.14: Possible split and join constellations for adjacent clusters. Each thick black line represents one RLE meta information after an update. The left thick line corresponds to the local meta information and the right one to the adjacent meta information. Different split and join constellations are possible: Split operations are marked with a red line, join operations with a green dashed line. See Fig. 5.15 for a concrete example.

Updating edge-based meta communication is based on flags stored per cluster which describe the split- and join-states of the cluster. All required states for markers of adjacent clusters are given in Table 5.1 and an overview of all relevant cases which have to be considered in Fig. 5.14.

Shared memory

pict

Figure 5.15: Two different split-split cluster states with the SFC traversal drawn closely to the shared interface either in opposite or identical direction. The blue arrow shows the traversal direction of the SFC curve close to the considered edge.

Similar to the exchange of edge communication data, updating the communication information is accomplished by read-only access of adjacent cluster data for shared memory access. Examples for the split-split constellations are given in Fig. 5.15.

The algorithm then consists of the following steps for each of the left and right RLE meta information lists:

Algorithm: Outer loop of RLE consistency

1.: An iterator is set to the first RLE entry in the meta information list.
2.: For each iteration on the next RLE entry, we test for adjacent cluster changes.
3.: In case that neither a split nor a join operation was detected in the adjacent cluster given by the RLE entry, we advance the iterator and continue with (3).
4.: The RLE entry requires additional processing due to possible inconsistencies created by split/join operations of the adjacent cluster. Since we are updating the communication information and modifying data in parallel, we have to avoid race conditions. These race conditions can occur since a processing of other clusters in parallel can lead to accessing the meta information of this cluster.
If this is the first time of a modification of one entry in the local RLE list during the list iteration, we duplicate the list of local RLE communication meta information and continue iterating on the duplicate. The iterator is set to the same entry in the duplicated data and further operations are only executed on the duplicated data.
5.: We continue by distinguishing different split or join cases. First, a check of the adjacent transfer state is done which can be NO_TRANSFER, SPLIT_PARENT or JOINED_CHILD, respectively, abbreviated with (NOP,SPLIT,JOIN). The second decision is the type of the local transfer state which can be NO_TRANSFER, SPLIT_CHILD or JOINED_PARENT, also respectively abbreviated with (NOP,SPLIT,JOIN). An overview of possible cases is given in Fig. 5.14.
6.: For split transfer states of the adjacent partition, an additional counter remainingCommunicationElements is introduced and set to the number of edges of the currently processed local RLE entry.
7.: Update the local RLE entry with one of the algorithmic blocks below, advance the iterator and continue at 3.

______________________________________

We distinguish the method for updating the RLE meta information depending on the transfer state of the adjacent cluster to decide in which access order to traverse the RLE-related adjacent clusters:

Algorithm: Local RLE entry update - adjacent transfer state: SPLIT_PARENT

Local transfer state: SPLIT_CHILD:
For this transfer state constellation, we further distinguish between the following cases:
1.
The split operations for both adjacent clusters lead to RLE entries with equal cardinality, see e.g. the right image in Fig. 5.15. Then only the information on the new placement (pointer/rank) of the adjacent cluster (this information changed due to the parent’s split) has to be updated.
2.
The RLE meta information is on a split of both adjacent clusters with the quantities not matching, see e.g. left image in Fig. 5.15.

In any case, we have to figure out the traversal order of the corresponding adjacent child cluster nodes.
Whether the first or second adjacent child node is traversed first, plays an important role to append additionally required RLE information in the correct order. This information is given by D with 0 for a traversal in the order “first, then second child” and 1 for “second, then first child”. We initialize our flag with D := 0. The traversal direction is then updated based on the local and adjacent SFC traversal direction on the shared cluster boundary:
- Rule 1) Traversal order of adjacently split children: If the cluster boundary traversal directions of the local and the adjacent parent’s cluster are equal, the traversal order of the adjacent children has to be in the same direction by setting D := 0, otherwise reversed by setting D := 1.
- Rule 2) Traversal order of locally split children: The traversal order is reversed if the RLE of the second local child is updated. This leads to a search of matching RLE entries from the end of the adjacent parent’s meta information.
An example for applying these rules is given below. Once the order of adjacent child access directions is known, the update process can start:
1.
remainingCommunicationElements is initialized with the number of local RLE elements for which the update is done.

An adjacent child is accessed according to the adjacent traversal direction D, and the edge communication elements are updated in case of matching RLE elements including decreasing the variable remainingCommunicationElements with the number of edge communication elements in the adjacent cluster.

If the variable remainingCommunicationElements is less or equal to zero, no further updates are required.

Otherwise, the next child is searched for a matching RLE entry with meta information of exactly remainingCommunicationElements elements.

Local transfer state: JOINED_PARENT
If the local cluster is a parent cluster which was created by joining two child clusters, the first adjacent cluster is searched for corresponding RLE elements. In case of a perfect match of the RLE number of shared edges, no update is required. Otherwise, a new RLE entry with the changed quantity of shared edges has to be inserted and the search is continued on the next adjacent child cluster for the remaining matching RLE element.

Local transfer state: NO_TRANSFER
In case that the local cluster was not modified, rules similar to the SPLIT_CHILD case can be applied. Because of this similarity, these are not further discussed here.

______________________________________

Given the rules introduced in the previous algorithm, we discuss the relevant scenarios given in left image in Fig. 5.15 for cluster b by traversing the adjacent clusters.

Left image:

Rule 1) The traversal on the cluster boundary of the shared edges of cluster b is counterclockwise and also the direction of e and f. This leads to a reversed child traversal order (D := 1).
Rule 2) Since cluster b is the first child’s triangle, the adjacent traversal order is not changed.
With D == 1, this leads to a reversal of the traversal of the adjacent child clusters, a second-first order. This means, that first cluster f is searched for adjacent edge parts, followed by cluster e. Each traversal leads to an insertion of corresponding new RLE entries to the RLE meta information. These entries replace the RLE entry associated to the split parent cluster.

Right image:

Rule 1) The SFC traversal direction on the cluster boundary of shared edge of cluster b is counter-clockwise and thus not equal to the cluster boundary direction of c which is clockwise, leading to a non-reversed adjacent cluster tree traversal (D := 0).
Rule 2) Since the triangle b is the first child triangle, the adjacent child traversal order is not changed (D := 0).
Combining both rules, the child traversal of the adjacent node is done in first-second order. Thus cluster d is searched for adjacent edge parts first, followed by cluster c.

Algorithm: Local RLE entry update - adjacent transfer state: JOINED_CHILD

In case of a joined adjacent child, the local RLE is tested for the possibility to join it with the next local RLE. Such a join is producing a consistent state in case that both consecutively stored adjacently joined children have the same parent. Otherwise, this would result in a violation of our assumption of a unique RLE existing for shared interfaces, see Sec. 5.2.3. Other updates of the RLE meta information can be applied similarly to the split case above.

Algorithm: Local RLE entry update - adjacent transfer state: NO_TRANSFER

With all local RLE entries being updated directly in case of split and join operations, no modifications of the RLE entry are required in case of an adjacent cluster without split/join.

Algorithm: Postprocessing RLE updates

Finally, the possibly RLE meta information which was duplicated to avoiding race conditions (see step 4 in the ’outer loop’ part of the algorithm above) is used as the new RLE meta information.

This algorithm computes a consistent meta information fulfilling all theorems described so far in this thesis.

Distributed memory

We like to give a short outlook to distributed-memory parallelization: Here, we use two-sided communication and the read-only push/pull principle: We separate the updates of the meta information into local and adjacent operations. Instead of looking up the required split/join information by accessing the adjacent clusters stored at another MPI node, our updates of the meta information for each local RLE entry rely on a cooperative communication:

All local clusters send the necessary split/join information about their local changes to the adjacent clusters (push). The adjacent clusters can then receive (pull) this data and update the meta communication information similar to the shared memory synchronization. This leads to a straightforward extension of the shared-memory algorithm presented above.

5.5.4 Reconstruction of vertex communication meta information

We have not considered the vertex communication meta information for dynamic clustering, yet. After cluster splits and joins, the vertex communication meta information can be in an inconsistent state since zero-length encoded vertex RLE entries can be missing. However, we can reconstruct the vertex communication information based on a valid edge communication from the previous section.

The algorithmic idea is based on the assumption that all clusters sharing the vertex also store one or two edge-related RLE meta information about edge-adjacent clusters (or a boundary) which also share the vertex. By accessing the edge-adjacent clusters, we can continue our search to the other vertex-sharing clusters. Then all clusters sharing a vertex can be traversed via following the RLE edge communication information associated to this vertex. This allows generation of a trace of all clusters sharing the vertex and, based on this trace, a reconstruction of our zero-length encoded vertex RLE meta information.

Detailed algorithm

Without loss of generality, we only show the algorithm with the assumption that both SFC traversal directions along the hyperfaces shared by two clusters are in opposite directions.

For each cluster P_i, let the left and right RLE edge communication meta information be given by R_k^{({left,right})} = (Q,m)_k with Q the adjacent cluster, m > 0 the number of edges shared with the adjacent partition and 1 < k < N_{({left,right})} selecting the RLE entry. The cluster associated to the first entry in the tuple R_j = (Q,m) is given by P(R_j) := R_j[1] == Q.

We use a flag F ∈{front,back} describing whether the vertex for which the meta information is determined is associated to the front or back of the currently processed edge-related RLE entry. For the sake of convenience, we formally² combine R^{({left,right})} to a single ring-like buffer of RLE meta information

( (right) (left) ) R := R i=1,2,...,N(right),Ri=N(left),N (left)- 1,...,1

with N_left and N_right the number of tuples in the left and right RLE meta information list, respectively. We further define a periodic padding with R₀ := R_|R| and R_|R|+1 := R₁. For domain boundaries, we further introduce a special RLE tuple (-1,-1) to account for the boundary edges.

pict

Figure 5.16: Sketch of reconstruction of vertex communication meta information. The meta information is reconstructed for the vertex shared by edge communication elements R_j and R_j+1 of cluster A.

Algorithm: Reconstruction of vertex meta information for cluster P_i

For all R_j ∈ R of the cluster P_i, execute the following operations:

Initialize vertex tracing: We use flag F which is initialized to back since the vertex is associated to the last edge processed by the RLE edge meta information. The previously visited cluster is kept in P_(prev) and initialized to P_i as the cluster to reconstruct the vertex information for. The next processed cluster then becomes P_(current) := P(R_j).
In case that P_(current) represents a boundary, we set the flag F to front and continue the search with P_(current) := P(R_j+1). If this is also a boundary, the vertex is not shared by any other cluster and we stop the algorithm.
Trace generation:
1.
For the currently processed cluster P_(current), the communication meta information R_(conn) connecting the currently visited cluster to the previous one is determined by conn := {i|P(R_i) = P_(prev)}
2.
Using flag F, we load the next (F = front: R_(next) := R_(conn+1)) or previous (F = back: R_(next) := R_(conn-1)) RLE communication meta information to R_(next). The next cluster to visit is then given by P(R_(next)).
(a)
If the adjacent cluster is of type boundary (P(R_(next)) == -1), we consider two cases:
(I) In case that this is the second time that we run into a boundary, we visited all adjacent clusters which are connected via edges and the algorithm terminates.
(II) Otherwise, we rerun our trace generation (1) on the start cluster P_(prev) := P_i but traversing in the other direction P_(current) := P(R_j+1). The flag F is then set to front.
(b)
In case of visiting the original cluster P_(current) == P_i, we found all clusters connected to the vertex and finish the trace generation.
(c)
The next cluster is visited by the following steps: First, we set the previously visited cluster to the current one P_(prev) := P_(current) and set the current cluster to the next cluster to visit P_(current) := P(R_j+1). Then we continue with our trace generation (1).

______________________________________

Based on the visited clusters, we can reconstruct the required RLE vertex meta information. For each RLE, at most 8 adjacent clusters which share the vertex are visited. Therefore, this algorithm has a constant runtime for each RLE. Furthermore, the algorithm is based on local (only vertex-adjacent) access operations only. Giving the average number of RLE elements per cluster in avgRLE, the runtime complexity is O(#Cluster ⋅ avgRLE).

pict

Figure 5.17: Generation of vertex communication data based on RLE edge communication for cluster C. Edge-based RLE information is given by blue arrows and the left and right information combined to R (see text for further description).

Example

We give a concrete example of this algorithm for reconstructing the RLE information for cluster C and the middle vertex at the top in Fig. 5.17. Here, we start with the RLE entry R₃ := (A,.) and initialize the algorithm to P_(prev) := C, P_(current) := A and F := back.

Visiting cluster A, we first determine the RLE associated to the previous cluster to be R₃ = (C,.). Since the F is set to back, we load the previous RLE R₂, yielding R_(next) := R₂ = (D,.).
We update the previous and current cluster to process to P_(prev) := P_(current) = A and P_(current) := P(R_(next)) = D
Visiting cluster D, we determine the adjacent RLE to be R₁ = (A,.). Since F is set to back, we load the previous RLE R₀ = R₃, yielding R_(next) := R₃ = (-1,-1).
Due to the detected boundary, we set F := front, P_(prev) := C and P_(current) := R_j+1 = B.
Processing cluster B, the RLE entry associated to the cluster C is given by R₁ = (C,.). F is set to front, thus we load the next RLE R₂, yielding R_(next) := R₂ = (-1,-1). Since this is the second time that we run into a boundary, the algorithm terminates.

We also like to mention an alternative way which was not implemented in this work: The information on the adjacently generated vertices can also be inferred similar to the reconstruction of edge meta information, see Sec. 5.5.3. With this method, we can also search on the adjacent split/joined cluster meta information for a possible requirement of inserting or removing vertex RLE meta information.

[next] [prev] [prev-tail] [front] [up]