These html pages are based on the
PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also
more information and a PDF version available.
5.11 Hybrid parallelization
The number of cores on cache-coherent memory domains considerably increased during the last
decade. Shared-memory systems with several threads per CPU are nowadays omnipresent and with
Intel’s XeonPhi, even more than 100 threads have to be programmed in a shared-memory environment
in an efficient way. Such a hybrid parallelization yields several advantages; some of them
are:
- Sampling of datasets:
Storage of either the entire or only a part of the bathymetry data for Tsunami simulations
for each single-threaded MPI rank could lead to severe memory consumptions. To give a
concrete example, we consider the ocean bathymetry datasets from General Bathymetric
Chart of the Oceans (GEBCO) [IOC] with the entire dataset of size of less than 2GB.
This alrady exceeds the sizes of memory typically available per core. E.g. on the current
generation of the SuperMUC, 16 cores share 32GB memory [EHB+13]. Thus with 2GB
memory consumption per thread to store the entire GEBCO datasets, this already occupies
all available memory.
With a hybrid parallelization, the datasets can be directly shared among several threads.
This allows storing the dataset only once in each program context, resulting in more
memory available for simulation data. We used this hybrid parallelization for the Tsunami
benchmarks in Section 6.3 with the entire bathymetry data loaded into each rank’s
memory.
- Reduced data migration:
Using single-threaded MPI can result in severe communication overheads in case of several
clusters being migrated at the same time. This can lead to a memory transfer similar to
a streaming benchmark due to migrated stacks and streams compactly stored in memory
and transferred block-wise. Using a hybrid parallelization, some data migration can be
avoided. In case of the clusters required to be migrated to a thread (considered to be a
rank for single-threaded MPI implementation) which executes tasks in the same memory
space in which the cluster is stored at, the cluster can be directly processed by the other
thread without requiring any cluster migration process.
We discuss two alternative approaches for the inter-cluster communication presented in Section 5.10.1.
- The first approach can be used to overcome a sequentialization of the iteration over
the clusters in reversed order to receive the data on the shared interfaces. An extension
of the send/recv tag with unique communication tags associating two clusters can be
used, e.g. involving both cluster unique IDs. This unique communication tags also assure
a unique message tag and, thus, no particular order has to be considered to read the
message. However, the MPI interface standard 3.0 [For12] only assures a tag range from
0 to 32767. This range can be exceeded by our cluster-based approach being based on
tree-splits: First, we require at least one bit for distinguishing between the left and right
communication stack. This would restrict our remaining tag range to ≈ 16383. Second, a
massive splitting can lead to by far more clusters than the available tag range, possibly
violating the requirements given by the MPI implementation.
- The second considered alternative is based on the thread ids instead of the cluster ids.
Since the number of threads is limited, the utilization of a valid range of tags can be
assured. A set of clusters can then be deterministically assigned to each thread (e.g. by
using the affinity ids) and each thread processes the set of cluster in parallel. This requires
an extension of the RLE communication meta information by also adding a thread id next
to the MPI rank (see [Mav02] for a similar concept).
Since our results already yield sufficient efficiency for hybrid parallelization to simulate Tsunamis on
distributed-memory systems, we did not implement these alternatives.