Technische Universität München
Lehrstuhl für Informatik mit Schwerpunkt Wissenschaftliches Rechnen
Cluster-Based Parallelization of
Simulations on Dynamically Adaptive Grids
and Dynamic Resource Management
Martin Schreiber
Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität
München zur Erlangung des Akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat.)
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Florian Matthes
Prüfer der Dissertation: Univ.-Prof. Dr. Hans-Joachim Bungartz
Univ.-Prof. Christian Bischof, Ph. D.
Die Dissertation wurde am 30.01.2014 bei der Technischen Universität München
eingereicht und durch die Fakultät für Informatik am 11.05.2014 angenommen.
Computational fluid dynamics (CFD) play an important role in a variety of disciplines. They are essential in fields such as car manufacturing, weather forecasting and combustion-engine design where they are used to optimize the aerodynamic behavior of car designs, to predict the weather and to understand processes inside an engine, respectively.
In this work, we focus on CFD simulations that are based on models described by partial differential equations (PDE), in particular by hyperbolic PDEs such as the shallow water and Euler equations. These hyperbolic PDEs model wave-propagation dominated phenomena and cannot be solved analytically in general. Therefore, these equations have to be discretized, which typically leads to a grid on which the scenario is computed. However, these simulations have a high computational intensity and, hence, require high-performance computing (HPC) systems.
Implementing such simulations with a static and regularly refined grid perfectly fits to the HPC parallel architectures and well-known parallelization techniques are applicable. Despite achieving high scalability, the time-to-solution for wave-dominated CFD simulations on regularly resolved domains is non-optimal. Considering a simulation of a wave propagation, such regular grids result in a very high resolution in the entire simulation domain despite the fact that the wave is propagating only in parts of the domain. Hence, many computations are invested in grid areas without a large contribution to the numerical solution.
With grids refining and coarsening during run time, this seems to be an easy-to-accomplish approach if only considering a non-parallel implementation and using standard computer science algorithms. However, further requirements for an efficient grid traversal and parallelization on state-of-the-art HPC systems lead to additional challenges with the major ones briefly summarized here:
Former work on the execution of simulations with stack- and stream-based algorithms already provides a solution for the memory-awareness issues (a). However, a scalable and efficient parallelization (b) with both parallelization models (c) is still subject of research and with a new parallelization approach also its usability (d).
We first give a basic introduction to the discontinuous Galerkin discretization of wave-propagation-dominated models in Part II. Based on this knowledge, we present our framework and the cluster-based parallelization in Part III. Simulations on a dynamically changing grid lead to a changing workload, hence also profit from dynamically changing resources on which we focus on Part IV for concurrently executed simulations.
Chapter 1: Continuity equation and applications
We start with an introduction to the governing equation to provide a better understanding of the
underlying system of equations. For a concrete applicability of our framework, we introduce the
shallow water and the Euler equations which are numerically solved by the discretization and solvers
presented in the next chapter.
Chapter 2: Discontinuous Galerkin discretization
The basic discretization for higher-order discontinuous Galerkin (DG) methods on triangular grids is
presented, followed by refinement and coarsening projections for dynamic adaptivity.
This part of the thesis introduces an efficient simulation of wave propagations based on a model with hyperbolic equations. With waves propagating over time, refining the grid in feature-rich areas and coarsening it otherwise is one major requirement. This requires efficient implementations of simulations on dynamically adaptive grids which represents one of the great challenges in nowadays HPC. Typical requirements are fast grid traversals, memory-hierarchy-aware data access and parallelization approaches for large-scale systems including fast load balancing to name just a few of them. Our software development approach aims at solving these issues and is presented in this part.
Chapter 3: Requirements and related work
Since our main goal is an efficient implementation of simulations on dynamically adaptive grids, we
first specify requirements on the framework mainly driven by the underlying hardware and show the
related work.
Chapter 4: Serial implementation
Based on this requirements analysis, spacetrees provide a solution. After a short introduction to these
spacetrees, we provide a formal description of existing work on the grid-data management and
communication system based on stacks and streams. This is followed by extensions, modifications and
optimizations compared to previous developments.
Chapter 5: Parallelization
We then introduce the parallelization of the inherently serial simulation with our cluster-based domain
decomposition. The resulting software design directly offers parallelization on shared- and
distributed-memory systems with the results presented in the respective sections.
Chapter 6: Application scenarios
To show the applicability of our development, we extended the framework with state-of-the-art solvers
for analytical convergence studies and a realistic Tsunami simulation. Furthermore, we present
different output backends for the simulation data. This chapter closes with extensions for simulations
on the sphere and multi-layers.
Chapter 6.7: Summary and outlook
Based on our new parallelization approach, we will give conclusions and an outlook for further
extensions of the framework in this Chapter.
Current batch-job schedulers of super-computing centers rely on a static number of resources assigned to a program during its execution time. This number of resources is typically specified by the application developer at the time of enqueuing the application to the batch system.
Considering the overall system’s state with multiple applications executed in parallel, such a static resource allocation is incapable to account for runtime-changing resource requirements of applications. These changing resource requirements are e.g. induced by simulations with dynamically adaptive mesh refinement. Hence, there is a demand for dynamic resource allocation which leads to challenging issues such as coping with the dynamic resource allocation for concurrently executed applications on HPC systems.
With the Invasive Computing paradigm, a promising approach is suggested for a dynamic resource management. This paradigm was initially suggested for multiprocessor System-on-Chip (MPSoC), in particular tightly-coupled processor arrays [Tei08] focusing on the efficient utilization of a two-dimensional array of computing cores for prospective computing architectures with hundreds and thousands of cores. There, the authors introduce the idea of the ability of programs to copy and execute themselves on computing resources in their proximity. Three basic operations are suggested; here we give a brief description on the operations which are relevant for this work:
Despite originally designed for MPSoCs, the Invasive Computing paradigm also yields the potential of being applied to our HPC computing issues.
Chapter 7: Invasive Computing with invasive hard- and software
We first give a brief introduction to our development on invasive extensions for algorithms developed
within the Transregio research project “Invasive Computing”. Since this thesis focuses on
Invasive Computing on HPC shared-memory architectures, we only show the multigrid
algorithm as a representative application from the area of scientific computing to show the
differences of Invasive Computing on MPSoCs to Invasive Computing on HPC shared-memory
systems.
Chapter 8: Invasive Computing for shared-memory HPC systems
For Invasive Computing in HPC on shared-memory systems, a concrete realization of Invasive
Computing is presented in this chapter. This is followed by results on the paradigm applied to
simulations on dynamically adaptive grids, including Tsunami parameter studies.
In this work, we focused on the efficient implementation of dynamically adaptive mesh refinement for optimizing the execution of numerical simulations. In particular, we considered wave-propagation dominated problems such as Tsunami simulations. We started with an introduction to the very basics of a discontinuous Galerkin (DG) discretization scheme which yielded our communication requirements with edge- and vertex-based communications for the considered DG solvers. For the discretization, we decided to use triangles as grid primitives and use a grid which is generated by the triangle-bisecting Sierpiński SFC.
Regarding the serial implementation, we first gave an introduction to existing work on stack- and stream-based simulations. Then, we introduced clear interfaces for the communication and data access to run DG simulations on dynamically adaptive grids. A code generator is developed which does not only lead to an efficient method to generate traversal code with interfaces tailored to the user requirements, but also yields optimizations with parameter unrolling, avoiding most of the if-branching instructions. Furthermore, we avoid obsolete access of memory e.g. by separation of structure, cell-data and adaptivity-state stacks. An automaton table considers the propagation direction of adaptivity information and is further used for optimizations. With a prospective stack reallocation, a good balance between memory requirements and frequent stack reallocation was presented.
For the parallelization, we first developed a run-length encoded meta communication information with its connectivity information implicitly updated by adaptivity markers written on the communication stacks. We use zero-length encodings for vertices which are not already represented by the edge-based meta information. Such a run-length encoded (RLE) meta communication information further allows an efficient block-wise communication. Then, we introduced clustering based on SFC-induced cell intervals and the aforementioned RLE communication meta information. Our cluster generation is based on tree-splits and -joins and we infer the new meta information after splits and joins implicitly via the elements stored on the stacks during a traversal. Our software and communication methods lead to a direct applicability of different parallelization models. We developed different cluster generation and scheduling methods on shared- and distributed-memory systems. On a 40-core shared-memory system, the owner-compute scheme yielded the best results due to the improved NUMA-domain awareness. Besides the RLE meta encoding, we developed further algorithmic optimizations with one of them the skipping of already conforming clusters. This results in a robust performance improvement, also compensating the additionally required conforming grid traversals required due to the domain decomposition. On distributed-memory systems the clustering leads to efficient block-wise communication as well as cluster-based data migration. With a dynamically adaptive grid in each time step, the strong scalability measured with a baseline at 256 cores is over 60% and the weak scalability is over 90% on 8192 cores.
Dynamically changing grids also lead to dynamically changing resource requirements. To cope with these changing resource requirements, we achieved a dynamic resource allocation by extending standard shared-memory HPC parallelization models with the Invasive Computing paradigm. We presented a resource manager on a shared-memory system and the required extensions to the simulation framework. This allows an optimization of resources based on application-specific information. We conducted several benchmarks with a dynamical redistribution of the computing resources among concurrently running applications on shared-memory systems, resulting in throughput improvements for concurrently executed Tsunami simulations of more than 43%.
Albert Einstein (1879 - 1955)