These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

pict
Technische Universität München
Lehrstuhl für Informatik mit Schwerpunkt Wissenschaftliches Rechnen

Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management

Martin Schreiber
Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des Akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
genehmigten Dissertation. Vorsitzender:    Univ.-Prof. Dr. Florian Matthes
Prüfer der Dissertation:    Univ.-Prof. Dr. Hans-Joachim Bungartz
   Univ.-Prof. Christian Bischof, Ph. D.

Die Dissertation wurde am 30.01.2014 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 11.05.2014 angenommen.

Contents

I Introduction
Content overview
Content overview
II Essential numerics of hyperbolic PDEs
1 Continuity equation and applications
1.1 Continuity equation
1.2 Examples of hyperbolic systems
2 Discontinuous Galerkin discretization
2.1 Grid generation
2.2 Triangle reference and world space
2.3 Basis functions
2.4 Weak formulation
2.5 Mass matrix

2.6 Stiffness matrices

2.7 Flux matrices

2.8 Source term
2.9 Rotational invariancy and edge space
2.10 Numerical flux

2.11 Boundary conditions
2.12 Adaptive refining and coarsening matrices

and

2.13 CFL stability condition
2.14 Time stepping schemes
III Efficient framework for simulations on dynamically adaptive grids
3 Requirements and related work
3.1 Simulation: grid, data and communication management
3.2 HPC requirements
3.3 Space-filling curves
3.4 Related work
4 Serial implementation
4.1 Grid generation with refinement trees
4.2 Stacks
4.3 Stack-based communication
4.4 Classification of data lifetime
4.5 Stack- and stream-based simulation on a static grid
4.6 Adaptivity
4.7 Verification of stack-based edge communication
4.8 Higher-order time stepping: Runge-Kutta
4.9 Software design, programmability and realization
4.10 Optimization
4.11 Contributions
5 Parallelization
5.1 SFC-based parallelization methods for DAMR
5.2 Inter-partition communication and dynamic meta information
5.3 Parallelization with clusters
5.4 Base domain triangulation and initialization of meta information
5.5 Dynamic cluster generation
5.6 Shared-memory parallelization
5.7 Results: Shared-memory parallelization
5.8 Cluster-based optimization
5.9 Results: Long-term simulations and optimizations on shared-memory
5.10 Distributed-memory parallelization
5.11 Hybrid parallelization
5.12 Results: Distributed-memory parallelization
5.13 Summary and Outlook
6 Application scenarios
6.1 Prerequisites
6.2 Analytic benchmark: solitary wave on composite beach
6.3 Field benchmark: Tohoku Tsunami simulation
6.4 Output backends
6.5 Simulations on the sphere
6.6 Multi-layer simulations
6.7 Summary and outlook
IV Invasive Computing
7 Invasive Computing with invasive hard- and software
7.1 Inavsive hardware architecture
7.2 Invasive software architecture
7.3 Invasive algorithms
7.4 Results
8 Invasive Computing for shared-memory HPC systems
8.1 Invasion with OpenMP and TBB
8.2 Invasive client layer
8.3 Invasive resource manager
8.4 Scheduling decisions
8.5 Invasive programming patterns
8.6 Results
9 Conclusion and outlook
V Summary
A Appendix
A.1 Hyperbolic PDEs
A.2 Test platforms
Bibliography

PART I

INTRODUCTION

Computational fluid dynamics (CFD) play an important role in a variety of disciplines. They are essential in fields such as car manufacturing, weather forecasting and combustion-engine design where they are used to optimize the aerodynamic behavior of car designs, to predict the weather and to understand processes inside an engine, respectively.

In this work, we focus on CFD simulations that are based on models described by partial differential equations (PDE), in particular by hyperbolic PDEs such as the shallow water and Euler equations. These hyperbolic PDEs model wave-propagation dominated phenomena and cannot be solved analytically in general. Therefore, these equations have to be discretized, which typically leads to a grid on which the scenario is computed. However, these simulations have a high computational intensity and, hence, require high-performance computing (HPC) systems.

Implementing such simulations with a static and regularly refined grid perfectly fits to the HPC parallel architectures and well-known parallelization techniques are applicable. Despite achieving high scalability, the time-to-solution for wave-dominated CFD simulations on regularly resolved domains is non-optimal. Considering a simulation of a wave propagation, such regular grids result in a very high resolution in the entire simulation domain despite the fact that the wave is propagating only in parts of the domain. Hence, many computations are invested in grid areas without a large contribution to the numerical solution.

With grids refining and coarsening during run time, this seems to be an easy-to-accomplish approach if only considering a non-parallel implementation and using standard computer science algorithms. However, further requirements for an efficient grid traversal and parallelization on state-of-the-art HPC systems lead to additional challenges with the major ones briefly summarized here:

(a): Memory-bandwidth and -hierarchy awareness:
As memory access is considered to be one of the main bottlenecks for the next decades of computing architectures, the algorithms should aim at reducing the memory footprint, e.g. during grid traversals. Regarding the data exchange between grid cells, the grid traversal should focus on reaccessing data which is already stored on higher cache levels.
(b): Parallelization models for shared- and distributed-memory systems:
For current HPC architectures, a cache-coherency is available within each compute node, hence supporting shared-memory parallelization models. However, a cache-coherency of memory access among all compute nodes is not feasible and demands for distributed-memory parallelization models. Furthermore, for an efficient parallelization on accelerators such as XeonPhi, shared- as well as distributed-memory parallelization models are mandatory for efficiency reasons.
(c): Load-balancing with dynamically changing grids:
Since the grid is refined and coarsened over the simulation’s runtime, load imbalances are created which lead to idle times and hence reduced efficiency. To avoid such idle times on large-scale systems, a load-balancing scheme is required to compensate these idle times.
(d): Usability:
Considering all the aforementioned HPC aspects (a)-(c), it is in general challenging to keep a sufficient usability which hides the complexity of the traversals of a dynamically changing grid, the utilization of both parallelization models and the load-balancing approach from the application developer.

Former work on the execution of simulations with stack- and stream-based algorithms already provides a solution for the memory-awareness issues (a). However, a scalable and efficient parallelization (b) with both parallelization models (c) is still subject of research and with a new parallelization approach also its usability (d).

We first give a basic introduction to the discontinuous Galerkin discretization of wave-propagation-dominated models in Part II. Based on this knowledge, we present our framework and the cluster-based parallelization in Part III. Simulations on a dynamically changing grid lead to a changing workload, hence also profit from dynamically changing resources on which we focus on Part IV for concurrently executed simulations.

PART II

ESSENTIAL NUMERICS OF HYPERBOLIC PDES This part provides the fundamental basics behind simulations of hyperbolic partial differential equations (PDEs) and their numerical discretization. These aspects are relevant for taking appropriate decisions concerning the development of the framework discussed in Part III of this thesis. Also closing the gap between a theoretical mathematical formulation of a model and its concrete implementation can only be achieved by a deep knowledge of the requirements set up by the problem to be solved.

Chapter 1: Continuity equation and applications
We start with an introduction to the governing equation to provide a better understanding of the underlying system of equations. For a concrete applicability of our framework, we introduce the shallow water and the Euler equations which are numerically solved by the discretization and solvers presented in the next chapter.

Chapter 2: Discontinuous Galerkin discretization
The basic discretization for higher-order discontinuous Galerkin (DG) methods on triangular grids is presented, followed by refinement and coarsening projections for dynamic adaptivity.

PART III

EFFICIENT FRAMEWORK FOR SIMULATIONS ON DYNAMICALLY ADAPTIVE GRIDS

This part of the thesis introduces an efficient simulation of wave propagations based on a model with hyperbolic equations. With waves propagating over time, refining the grid in feature-rich areas and coarsening it otherwise is one major requirement. This requires efficient implementations of simulations on dynamically adaptive grids which represents one of the great challenges in nowadays HPC. Typical requirements are fast grid traversals, memory-hierarchy-aware data access and parallelization approaches for large-scale systems including fast load balancing to name just a few of them. Our software development approach aims at solving these issues and is presented in this part.

Chapter 3: Requirements and related work
Since our main goal is an efficient implementation of simulations on dynamically adaptive grids, we first specify requirements on the framework mainly driven by the underlying hardware and show the related work.

Chapter 4: Serial implementation
Based on this requirements analysis, spacetrees provide a solution. After a short introduction to these spacetrees, we provide a formal description of existing work on the grid-data management and communication system based on stacks and streams. This is followed by extensions, modifications and optimizations compared to previous developments.

Chapter 5: Parallelization
We then introduce the parallelization of the inherently serial simulation with our cluster-based domain decomposition. The resulting software design directly offers parallelization on shared- and distributed-memory systems with the results presented in the respective sections.

Chapter 6: Application scenarios
To show the applicability of our development, we extended the framework with state-of-the-art solvers for analytical convergence studies and a realistic Tsunami simulation. Furthermore, we present different output backends for the simulation data. This chapter closes with extensions for simulations on the sphere and multi-layers.

Chapter 6.7: Summary and outlook
Based on our new parallelization approach, we will give conclusions and an outlook for further extensions of the framework in this Chapter.

PART IV

INVASIVE COMPUTING

Current batch-job schedulers of super-computing centers rely on a static number of resources assigned to a program during its execution time. This number of resources is typically specified by the application developer at the time of enqueuing the application to the batch system.

Considering the overall system’s state with multiple applications executed in parallel, such a static resource allocation is incapable to account for runtime-changing resource requirements of applications. These changing resource requirements are e.g. induced by simulations with dynamically adaptive mesh refinement. Hence, there is a demand for dynamic resource allocation which leads to challenging issues such as coping with the dynamic resource allocation for concurrently executed applications on HPC systems.

With the Invasive Computing paradigm, a promising approach is suggested for a dynamic resource management. This paradigm was initially suggested for multiprocessor System-on-Chip (MPSoC), in particular tightly-coupled processor arrays [Tei08] focusing on the efficient utilization of a two-dimensional array of computing cores for prospective computing architectures with hundreds and thousands of cores. There, the authors introduce the idea of the ability of programs to copy and execute themselves on computing resources in their proximity. Three basic operations are suggested; here we give a brief description on the operations which are relevant for this work:

With invade, each computation resource can request additional resources. Desired properties of the requested resources can be specified by an additional constraints. Such constraints can be e.g. the number of cores, floating-point support, etc.
The next step is infect which replicates the program into the successfully invaded cores and starts the program.
Once the computations are finished, a retreat frees the previously invaded cores and makes them available for further invades.

Despite originally designed for MPSoCs, the Invasive Computing paradigm also yields the potential of being applied to our HPC computing issues.

Chapter 7: Invasive Computing with invasive hard- and software
We first give a brief introduction to our development on invasive extensions for algorithms developed within the Transregio research project “Invasive Computing”. Since this thesis focuses on Invasive Computing on HPC shared-memory architectures, we only show the multigrid algorithm as a representative application from the area of scientific computing to show the differences of Invasive Computing on MPSoCs to Invasive Computing on HPC shared-memory systems.

Chapter 8: Invasive Computing for shared-memory HPC systems
For Invasive Computing in HPC on shared-memory systems, a concrete realization of Invasive Computing is presented in this chapter. This is followed by results on the paradigm applied to simulations on dynamically adaptive grids, including Tsunami parameter studies.

PART V

SUMMARY

In this work, we focused on the efficient implementation of dynamically adaptive mesh refinement for optimizing the execution of numerical simulations. In particular, we considered wave-propagation dominated problems such as Tsunami simulations. We started with an introduction to the very basics of a discontinuous Galerkin (DG) discretization scheme which yielded our communication requirements with edge- and vertex-based communications for the considered DG solvers. For the discretization, we decided to use triangles as grid primitives and use a grid which is generated by the triangle-bisecting Sierpiński SFC.

Regarding the serial implementation, we first gave an introduction to existing work on stack- and stream-based simulations. Then, we introduced clear interfaces for the communication and data access to run DG simulations on dynamically adaptive grids. A code generator is developed which does not only lead to an efficient method to generate traversal code with interfaces tailored to the user requirements, but also yields optimizations with parameter unrolling, avoiding most of the if-branching instructions. Furthermore, we avoid obsolete access of memory e.g. by separation of structure, cell-data and adaptivity-state stacks. An automaton table considers the propagation direction of adaptivity information and is further used for optimizations. With a prospective stack reallocation, a good balance between memory requirements and frequent stack reallocation was presented.

For the parallelization, we first developed a run-length encoded meta communication information with its connectivity information implicitly updated by adaptivity markers written on the communication stacks. We use zero-length encodings for vertices which are not already represented by the edge-based meta information. Such a run-length encoded (RLE) meta communication information further allows an efficient block-wise communication. Then, we introduced clustering based on SFC-induced cell intervals and the aforementioned RLE communication meta information. Our cluster generation is based on tree-splits and -joins and we infer the new meta information after splits and joins implicitly via the elements stored on the stacks during a traversal. Our software and communication methods lead to a direct applicability of different parallelization models. We developed different cluster generation and scheduling methods on shared- and distributed-memory systems. On a 40-core shared-memory system, the owner-compute scheme yielded the best results due to the improved NUMA-domain awareness. Besides the RLE meta encoding, we developed further algorithmic optimizations with one of them the skipping of already conforming clusters. This results in a robust performance improvement, also compensating the additionally required conforming grid traversals required due to the domain decomposition. On distributed-memory systems the clustering leads to efficient block-wise communication as well as cluster-based data migration. With a dynamically adaptive grid in each time step, the strong scalability measured with a baseline at 256 cores is over 60% and the weak scalability is over 90% on 8192 cores.

Dynamically changing grids also lead to dynamically changing resource requirements. To cope with these changing resource requirements, we achieved a dynamic resource allocation by extending standard shared-memory HPC parallelization models with the Invasive Computing paradigm. We presented a resource manager on a shared-memory system and the required extensions to the simulation framework. This allows an optimization of resources based on application-specific information. We conducted several benchmarks with a dynamical redistribution of the computing resources among concurrently running applications on shared-memory systems, resulting in throughput improvements for concurrently executed Tsunami simulations of more than 43%.

The secret to creativity is knowing how to hide your sources.

Albert Einstein (1879 - 1955)