We summarize our major developments and contributions, followed by an outlook to future work:
On shared-memory systems, we evaluated TBB and OpenMP on a 40-core NUMA system yielding high scalability for short- and long-term simulations with NUMA-aware scheduling. Here, the owner-compute scheme with a threshold-based splitting showed the best results.
On distributed-memory systems, our software concept leads to an efficient data migration with the clustering concept. Our benchmarks show a weak scalability of over 80% on more than 8000 cores with the baseline at 256 cores.
We envision the following possible further developments:
Regarding issue (a), we propose to generate an indexing structure to avoid any hanging nodes inside each cluster in a single traversal. This indexing structure can be generated with a single traversal by forwarding the current cell index with the edge communication. The result is a directed-acyclic graph with its edges directed towards the adjacent cells with a higher SFC-enumerated index. The root node is the latest traversed cell and its last node is the first cell. Based on this directed graph, we can reconstruct the indexing to all cells, yielding an bidirected graph. This indexing makes the direct forwarding of hanging nodes markers within each cluster possible, hence not requiring additional traversals.
Considering issue (b), the synchronization barriers can be circumvented by allowing hanging nodes on the cluster boundaries. We can represent hanging nodes by splitting an RLE entry with appropriate handling required.
No global synchronizations, e.g. reductions on adaptivity conformity states, are required due to solving (a) and (b).