These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.

There is also more information and a PDF version available.

There is also more information and a PDF version available.

We still have to choose when to split and join clusters as well as which computation units shall execute operations on which clusters.

We start with annotations used for cluster generation and scheduling decisions. Considering a parallelization by creating as many clusters as there are compute units available, this results in a 1:1 scheduling with an example given in Section 5.1.3. With our dynamic clustering based on tree splits, this would clearly lead to high idle times due to severe load imbalances, e.g. created by refined grid cells in a single cluster, see Fig. 5.18.

In this work, we developed two different split approaches: massive-splitting and load-balancing
oriented splits. Both methods are based on annotating each cluster i with the workload W_{i}, e.g. the
number of cells in each cluster. Then, the entire simulation workload is given by W := ∑
_{i}W_{i}. For the
load-balanced-oriented splits, we further label each cluster with a range-related information
R_{i} := R_{i-1} + W_{i-1} and set R_{1} := 0. These labels are computed with a parallel traversal of the cluster
tree (see Section 5.3.3) with subindices lhs, rhs and parent denoting the left or right child of the
parent tree node, respectively:

- We run through the cluster tree bottom-up and annotate each inner cluster tree node with
W := W
_{lhs}+ W_{rhs}. - During a second top-down traversal, we start at the root node and annotate it with
R
_{root}:= 0. During the top-down traversal, we then annotate each left child node with R_{lhs}:= R_{parent}and each right child node with R_{rhs}:= R_{parent}+ W_{lhs}.

Then, W_{j} represents the workload of all child clusters for a particular cluster tree node j and R_{j} the
range [R_{j},R_{j} + W_{j}) of workloads, see Fig. 5.19. We continue using these annotations for scheduling
and generation decisions.

We consider different scheduling strategies to assign the tasks on clusters to compute units. In this context, tasks are any kind of operations to be executed on clusters: time step, particular adaptivity traversal, backend, setup, etc.

With p compute units, let W_{avg} := be the average workload for each compute unit. Each cluster i
is then assigned to compute unit

| (5.1) |

To avoid computations of this association for each traversal of the cluster tree, we cache the range of thread ids i fulfilling Eq. (5.1) and store it to each leaf node during the dynamic cluster-generation traversals. This yields a unique ownership of each cluster. Furthermore, we also reduce this information bottom-up and extend the annotation of each node in the cluster tree by a range of compute units which own at least one leaf node of the currently processed node.

We implement two different scheduling strategies by either traversing only tree nodes for which Equation (5.1) holds or by setting task affinities for tasks created on each node. These two scheduling strategies are denoted with owner-compute and affinities and these strategies can be used to consider the spatial locality of each cluster on a NUMA memory hierarchy.

Using the task-flooding scheduling, we create a task for each cluster tree node. This leads to operations (task) being executed via work-stealing mechanisms with the used shared-memory parallelization models.

We present two different strategies for the cluster generation in this section.

An owner-compute scheme belongs to the classical domain decomposition since it directly aims at
assigning computations in a static way to compute units. For perfect load balancing, equal workload
should be assigned to each compute unit. Assigning clusters to compute units, the load balancing can
be improved by splitting clusters in which cells can be assigned to more than one compute unit. Given
p compute units, the average workload per compute unit is given by W_{avg} := . This leads to
splitting a cluster i, if

An alternative parallelization method is given by massive splitting. We use the assumption that the overhead for starting computations on a cluster is relatively small compared to the computational workload in each cluster. This allows us to create more clusters compared to the number of compute units available and to efficiently process them due to the small overhead. Contrary to a 1:1 scheduling of tree splits which would yield high idle time due to work imbalances, we expect decreased idle time by creating by far more clusters than there are compute units available (See Fig. 5.20).

For massive splitting, we split each cluster if its workload exceeds a particular threshold T_{s} and join
two cluster, if the workload of both child clusters undershoots T_{j}. The join threshold can therefore be
evaluated cluster-locally only, and we only compare the boolean agreement to the join operation of
both child clusters.

In this work, we use a join threshold T_{j} := with T_{s} representing the split threshold. Selecting
a join threshold in this way assures that a join of two cluster does not directly result in a split due to
joined clusters directly exceeding the split threshold.

With the growing number of cores on shared-memory systems, a high variety of threading libraries are available. Considering dynamically changing resources (see Part IV), also support for such extensions to a threading model is required.

To introduce the parallel execution of operations on each cluster (see Section 5.6), we consider two
different threading tools which are representative for other thread-based parallelization models:
OpenMP^{3} and Threading
Building Blocks (TBB)^{4} .
Both threading libraries offer parallelism in different ways. With OpenMP, language extensions via
pragma precompiler directives are used to express parallelization in the program code, whereas
parallelization features of TBB are made available to the application developer via C++ and even
more convenient features via C++11 language features. Since information on both parallelization
methods is available in other work and reference guides, we refer the interested reader to the
corresponding websites and reference guides for an introduction and only highlight how these
parallelization models are used in our development.

Both threading extensions provide support for the very basic parallelization features required for our cluster-based framework: parallel processing of a for-loop as well as tasking. We implemented three different ways for a parallel execution of operations on clusters.

- (a)
- An owner-compute implementation starts a task for each compute unit via parallel execution of a for-loop over the range of all compute units. Each for-loop iteration and, thus, each compute unit executes operations only for a subset of clusters for which Eq. (5.1), page 277 holds.
- (b)
- Creating a task for each cluster-tree node. In case of massive splitting, this leads to a flooding of the overall system with tasks and uses work stealing with both considered threading libraries. During the traversal of our cluster tree, the implementation emits a new task for each cluster tree node.
- (c)
- We can further use features offered by TBB to create tasks with affinities to compute units. Such an attribute leads to enqueuing tasks into the working queue of the thread for which the affinity was set for.