These html pages are based on the
PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also
more information and a PDF version available.
3.2 HPC requirements
We highlight several mandatory aspects in the context of next-generation HPC systems. These
systems demand for consideration of memory hierarchies, cache-oblivious behavior as well
as data locality. For the memory access optimizations, we focus on the following three
aspects:
-
(a)
- Size of accessed memory: With the memory wall assumed to be one of the main bottlenecks
in the future [Xie13], the memory transfers should be reduced to a minimum.
-
(b)
- Data migration: For load balancing reasons, also efficient data migration has to be provided.
Such a data migration should use asynchronous communication, lowering the amount of
migrated data and provide an efficient method to update the adjacency information.
-
(c)
- Energy efficiency: With memory access on the next generation architectures expected to
require increasing energy consumption compared to the computations [BC11], a reduction
of memory accesses is expected to directly lead to energy optimized algorithms.
Next, we discuss the parallelization on thread and instruction level. Current hardware generations
are not able to scale with Moore’s law by increasing the frequency only due to physical
constraints [BC11]. To reduce the computation time for simulations with a given discretization, the
current way out of this frequency limitation is a parallelization on thread and instruction level.
Therefore, the algorithm design further requires the capability to run efficiently on highly parallel
systems with dozens and even hundreds of cores. With a dynamically changing grid, this is considered
to be challenging due to the steadily changing workload, and in our case changing workload after each
time step. Two different parallelization methods regarding Flynn’s taxonomy [Fly66] are considered
nowadays:
-
(a)
- MIMD (multiple instructions, multiple data): For thread level parallelization, future parallel
HPC systems provide a mix of cache coherency and are considered for the framework
design: on shared-memory systems, cache-coherent memory is typically available whereas
non-cache-coherent memory is provided across distributed-memory systems. Considering
accelerator cards such as the Xeon Phi, a hybrid parallelization is getting mandatory.
-
(b)
- SIMD (single instruction, multiple data): On instruction level parallelism, today’s HPC
computing architectures demand data to be stored and processed in vector format. This
allows efficient data access and processing with vector operations executing multiple
operations on each vector element in parallel. E.g. on the current XeonPhi architecture, one
vector can store 16 single-precision numbers. Using such operations is mandatory for getting
close to the maximum flop rate of one chip, thus should also be considered in the software
development.