These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

3.2 HPC requirements

We highlight several mandatory aspects in the context of next-generation HPC systems. These systems demand for consideration of memory hierarchies, cache-oblivious behavior as well as data locality. For the memory access optimizations, we focus on the following three aspects:

Size of accessed memory: With the memory wall assumed to be one of the main bottlenecks in the future [Xie13], the memory transfers should be reduced to a minimum.
Data migration: For load balancing reasons, also efficient data migration has to be provided. Such a data migration should use asynchronous communication, lowering the amount of migrated data and provide an efficient method to update the adjacency information.
Energy efficiency: With memory access on the next generation architectures expected to require increasing energy consumption compared to the computations [BC11], a reduction of memory accesses is expected to directly lead to energy optimized algorithms.

Next, we discuss the parallelization on thread and instruction level. Current hardware generations are not able to scale with Moore’s law by increasing the frequency only due to physical constraints [BC11]. To reduce the computation time for simulations with a given discretization, the current way out of this frequency limitation is a parallelization on thread and instruction level. Therefore, the algorithm design further requires the capability to run efficiently on highly parallel systems with dozens and even hundreds of cores. With a dynamically changing grid, this is considered to be challenging due to the steadily changing workload, and in our case changing workload after each time step. Two different parallelization methods regarding Flynn’s taxonomy [Fly66] are considered nowadays:

MIMD (multiple instructions, multiple data): For thread level parallelization, future parallel HPC systems provide a mix of cache coherency and are considered for the framework design: on shared-memory systems, cache-coherent memory is typically available whereas non-cache-coherent memory is provided across distributed-memory systems. Considering accelerator cards such as the Xeon Phi, a hybrid parallelization is getting mandatory.
SIMD (single instruction, multiple data): On instruction level parallelism, today’s HPC computing architectures demand data to be stored and processed in vector format. This allows efficient data access and processing with vector operations executing multiple operations on each vector element in parallel. E.g. on the current XeonPhi architecture, one vector can store 16 single-precision numbers. Using such operations is mandatory for getting close to the maximum flop rate of one chip, thus should also be considered in the software development.