These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

5.6 Shared-memory parallelization


pict

Figure 5.18: Parallelization by splitting cluster in a quantity equal to the number of processors. This leads to a high idle time for adaptive grids [SBB12].

We still have to choose when to split and join clusters as well as which computation units shall execute operations on which clusters.

We start with annotations used for cluster generation and scheduling decisions. Considering a parallelization by creating as many clusters as there are compute units available, this results in a 1:1 scheduling with an example given in Section 5.1.3. With our dynamic clustering based on tree splits, this would clearly lead to high idle times due to severe load imbalances, e.g. created by refined grid cells in a single cluster, see Fig. 5.18.

In this work, we developed two different split approaches: massive-splitting and load-balancing oriented splits. Both methods are based on annotating each cluster i with the workload Wi, e.g. the number of cells in each cluster. Then, the entire simulation workload is given by W := iWi. For the load-balanced-oriented splits, we further label each cluster with a range-related information Ri := Ri-1 + Wi-1 and set R1 := 0. These labels are computed with a parallel traversal of the cluster tree (see Section 5.3.3) with subindices lhs, rhs and parent denoting the left or right child of the parent tree node, respectively:


pict

Figure 5.19: Annotation of a cluster with weights Wj and ranges Rj from left to right  [SWB13a]
.

Then, Wj represents the workload of all child clusters for a particular cluster tree node j and Rj the range [Rj,Rj + Wj) of workloads, see Fig. 5.19. We continue using these annotations for scheduling and generation decisions.

5.6.1 Scheduling strategies

We consider different scheduling strategies to assign the tasks on clusters to compute units. In this context, tasks are any kind of operations to be executed on clusters: time step, particular adaptivity traversal, backend, setup, etc.

Owner compute and affinities

With p compute units, let Wavg := Wp be the average workload for each compute unit. Each cluster i is then assigned to compute unit

     ⌊R  + Wi-⌋
j :=  --i---2-  .
        Wavg
(5.1)

To avoid computations of this association for each traversal of the cluster tree, we cache the range of thread ids i fulfilling Eq. (5.1) and store it to each leaf node during the dynamic cluster-generation traversals. This yields a unique ownership of each cluster. Furthermore, we also reduce this information bottom-up and extend the annotation of each node in the cluster tree by a range of compute units which own at least one leaf node of the currently processed node.

We implement two different scheduling strategies by either traversing only tree nodes for which Equation (5.1) holds or by setting task affinities for tasks created on each node. These two scheduling strategies are denoted with owner-compute and affinities and these strategies can be used to consider the spatial locality of each cluster on a NUMA memory hierarchy.

Task flooding

Using the task-flooding scheduling, we create a task for each cluster tree node. This leads to operations (task) being executed via work-stealing mechanisms with the used shared-memory parallelization models.

5.6.2 Cluster generation strategies

We present two different strategies for the cluster generation in this section.

Load-balanced splitting

An owner-compute scheme belongs to the classical domain decomposition since it directly aims at assigning computations in a static way to compute units. For perfect load balancing, equal workload should be assigned to each compute unit. Assigning clusters to compute units, the load balancing can be improved by splitting clusters in which cells can be assigned to more than one compute unit. Given p compute units, the average workload per compute unit is given by Wavg := Wp. This leads to splitting a cluster i, if

⌊     ⌋   ⌊            ⌋
  -Ri-- ⁄=   Ri +-Wi---1  .
  Wavg         Wavg
This targets at improved balancing of work decomposition and considers only the pure workload under the assumption that there are overheads until the initialization of computations on each cluster.
Massive splitting

An alternative parallelization method is given by massive splitting. We use the assumption that the overhead for starting computations on a cluster is relatively small compared to the computational workload in each cluster. This allows us to create more clusters compared to the number of compute units available and to efficiently process them due to the small overhead. Contrary to a 1:1 scheduling of tree splits which would yield high idle time due to work imbalances, we expect decreased idle time by creating by far more clusters than there are compute units available (See Fig. 5.20).


pict

Figure 5.20: Using massive splitting creating by far more clusters than there are processors.

For massive splitting, we split each cluster if its workload exceeds a particular threshold Ts and join two cluster, if the workload of both child clusters undershoots Tj. The join threshold can therefore be evaluated cluster-locally only, and we only compare the boolean agreement to the join operation of both child clusters.

In this work, we use a join threshold Tj := ⌊Ts⌋
 2 with Ts representing the split threshold. Selecting a join threshold in this way assures that a join of two cluster does not directly result in a split due to joined clusters directly exceeding the split threshold.

5.6.3 Threading libraries

With the growing number of cores on shared-memory systems, a high variety of threading libraries are available. Considering dynamically changing resources (see Part IV), also support for such extensions to a threading model is required.

To introduce the parallel execution of operations on each cluster (see Section 5.6), we consider two different threading tools which are representative for other thread-based parallelization models: OpenMP3 and Threading Building Blocks (TBB)4 . Both threading libraries offer parallelism in different ways. With OpenMP, language extensions via pragma precompiler directives are used to express parallelization in the program code, whereas parallelization features of TBB are made available to the application developer via C++ and even more convenient features via C++11 language features. Since information on both parallelization methods is available in other work and reference guides, we refer the interested reader to the corresponding websites and reference guides for an introduction and only highlight how these parallelization models are used in our development.

Both threading extensions provide support for the very basic parallelization features required for our cluster-based framework: parallel processing of a for-loop as well as tasking. We implemented three different ways for a parallel execution of operations on clusters.

(a)
An owner-compute implementation starts a task for each compute unit via parallel execution of a for-loop over the range of all compute units. Each for-loop iteration and, thus, each compute unit executes operations only for a subset of clusters for which Eq. (5.1), page 277 holds.
(b)
Creating a task for each cluster-tree node. In case of massive splitting, this leads to a flooding of the overall system with tasks and uses work stealing with both considered threading libraries. During the traversal of our cluster tree, the implementation emits a new task for each cluster tree node.
(c)
We can further use features offered by TBB to create tasks with affinities to compute units. Such an attribute leads to enqueuing tasks into the working queue of the thread for which the affinity was set for.