Summary and Outlook

These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

[prev] [prev-tail] [tail] [up]

5.13 Summary and Outlook

We summarize our major developments and contributions, followed by an outlook to future work:

RLE meta communication information:
We used properties of the stack-based simulation to develop a parallelization based on a run-length encoding which leads to the following advantages. Our communication meta information is stored for inter-cluster shared hyperfaces only and is run-length encoded. This RLE edge meta information is updated implicitly, based on the adaptivity markers. We can also encode the vertex meta information efficiently with our RLE using a zero-run-length encoding, resulting in a reduced memory consumption. The data access on shared- and distributed-memory systems can then be accomplished block-wise. Considering the DG simulations, the edge- and vertex-based communication allows an implementation of possibly required flux limiters for edges and vertices.
Parallelization:
Regarding the parallelization, we developed a cluster-based software design. Here, one or more independent chunks of the simulation grid reside in the same memory context. These chunks can be traversed in arbitrary order and the communication is accomplished based on run-length encoded meta information.
Dynamic cluster generation:
With the dynamically changing grids leading to a different number of grid cells in each cluster, we derive the meta information after tree-splits and -joins in an efficient way. Our dynamic cluster generation implicitly derives the new meta information based on the number of entries stored on the stacks. Two different ways of dynamic cluster generation have been investigated. The range-based clustering generates the clusters aiming for load balancing. However, our tested scenario showed overheads compared to a threshold-based cluster generation, leading to the best results.
Parallelization models:
Our software and communication design results to direct applicability of different parallelization models.
On shared-memory systems, we evaluated TBB and OpenMP on a 40-core NUMA system yielding high scalability for short- and long-term simulations with NUMA-aware scheduling. Here, the owner-compute scheme with a threshold-based splitting showed the best results.
On distributed-memory systems, our software concept leads to an efficient data migration with the clustering concept. Our benchmarks show a weak scalability of over 80% on more than 8000 cores with the baseline at 256 cores.
Skipping of conforming cluster traversals:
We used the adaptivity automaton with the skipping of clusters with an already conforming grid state. This results in robust performance improvements, also compensating additionally required conforming grid traversals due to the domain decomposition.
Data migration:
After the dynamic clustering phase, we can migrate clusters efficiently to adjacent MPI ranks. This requires migrating the cluster’s raw data which is stored compactly on the stacks. Since all connectivity information is stored implicitly with the structure bit stream and since the simulation stacks only contain the pure payload of the simulation (e.g. only the DoF values), we assume the amount of transferred memory to be quasi-optimal. Furthermore, only pre- and postprocessing of the RLE meta information is required which is also quasi-optimal due to the run-length encoding.

We envision the following possible further developments:

SFC cuts:
Our dynamic cluster generation is based on spacetree splits. This has drawbacks for our load-balanced-aware splitting (see Sec. 5.6.2) with two of them mentioned here: First of all, we can only iteratively split the spacetree to improve the load balancing. Second, in case of two adjacent clusters processed by the same thread (e.g. using owner-compute), their shared hyperfaces still require a reduction operation despite that they are processed by the same thread. We expect that a cluster generation based on SFC cuts can also solve the aforementioned issues.
Loosening the conforming grid requirement:
For a conforming grid generation, we forward the adaptivity requirements with the stack-based edge communication. This (a) requires multiple consistency traversals, (b) can lead to additional traversals due to the domain decomposition and (c) involves additional overheads for reduction operations on conforming grid states.
Regarding issue (a), we propose to generate an indexing structure to avoid any hanging nodes inside each cluster in a single traversal. This indexing structure can be generated with a single traversal by forwarding the current cell index with the edge communication. The result is a directed-acyclic graph with its edges directed towards the adjacent cells with a higher SFC-enumerated index. The root node is the latest traversed cell and its last node is the first cell. Based on this directed graph, we can reconstruct the indexing to all cells, yielding an bidirected graph. This indexing makes the direct forwarding of hanging nodes markers within each cluster possible, hence not requiring additional traversals.
Considering issue (b), the synchronization barriers can be circumvented by allowing hanging nodes on the cluster boundaries. We can represent hanging nodes by splitting an RLE entry with appropriate handling required.
No global synchronizations, e.g. reductions on adaptivity conformity states, are required due to solving (a) and (b).
We assume, that the RLE meta communication information can also be used for Cartesian and hexagonal grids and that it can be also applied to higher-dimensional grids, e.g. grids generated by the Peano SFC.

[prev] [prev-tail] [front] [up]