These html pages are based on the
PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also
more information and a PDF version available.
8.2 Invasive client layer
The Invasive paradigm was originally developed for specialized Invasive hardware, whereas we apply
this paradigm on HPC shared-memory systems. This leads to different programming interfaces which
are discussed next.
- Joined invade and infect call:
We can join the invade and infect calls to a single invade. This is due to the shared-memory
systems not requiring to infect a compute resource by replicating the kernel code since it is
already accessible in each thread’s memory. As soon as the Invasive Computing paradigm
is extended to HPC with distributed-memory systems, the infect call is required, e.g. to
synchronize the simulation data such as the number of time steps.
- No claims:
For our target application, the main simulation loop leads to invasions only requiring a
single claim which represents the currently invaded resources. Thus, we assume a single
claim existing for each application and invasions modifying this claim only.
This results in a simplified command space without infects and also no claim for Invasive
Computing with our iteration-based shared-memory application. We next discuss the required
constraint system for our simulations on dynamically changing grids (Sec. 8.2.1), the communication
to the resource manager (Sec. 8.2.2) and the Invasive Computing API for shared-memory systems
(Sec. 8.2.3).
8.2.1 Constraints
To schedule the compute resources for which the applications compete for, the applications have to
provide information on their current state. Such a constraint specification can lead to very complex
structures if requiring a general applicability on heterogeneous systems. E.g. an application can either
require 3 cores and one accelerator or in case that the accelerator was not invadable, require more
than three cores. This can be expressed with a tree-structure, resulting in a constraint-hierarchy
(see [BRS+13]). Then, different constraint constellations can be combined with AND or OR relations.
On the one hand, this constraint hierarchy yields a high flexibility, whereas on the other hand,
forwarding this information to the resource manager can lead to expensive communication overheads
to/from the resource manager. This is due to serialization of the constraint tree, sending the serialized
version to the RM, receiving it and unpacking it by de-serialization. Additional complexity is
generated inside the RM which has to evaluate the constraint hierarchy by using the constraint
trees of different applications to optimize the resource assignment concurrently scheduled
applications.
With our HPC shared-memory systems considered in this work which are only based on
homogeneous cores, we decided to avoid such a constraint hierarchy by using a fixed number of
non-hierarchical constraint properties in a list. Our constraint system is then based on the following
properties:
- min/max cores: This constraint limits the number of requested resources to the specified
range of resources which are claimed.
- scalability graph: Such a scalability graph is specified with a one-dimensional array with
each entry representing the scalability for the number of cores.
- workload: This hint can be used to distribute more cores to applications with more
workload and vice versa.
Once specified by the application, these constraints then have to be communicated to the resource
manager with details on the resource scheduling for scalability graphs and distribution hints further
described in Section 8.4.
8.2.2 Communication to resource manager
A centralized resource manager is used which is responsible to optimize the resources based on
constraints received from applications. Our client-server design (see Fig 8.1) suggests a message-based
communication with the clients transferring constraints of a typically small message size to the server
running as a dedicated process. Then the server (resource manager) requires receiving the
message, processing it and sending one or more messages to the client (application). To
overcome message processing latencies, also the ability of asynchronous communication is
required.
All these requirements are fulfilled by the IPC System V message queues (MQ), supporting small
message sizes which can be efficiently exchanged with the server. MQs also support testing whether a
message can be dequeued for non-blocking communication.
During its setup, the centralized resource manager creates a message queue with a unique
identifier. This makes the message queue singleton-like for all invasive applications which use the
same identifier to communicate to the RM. Both the client and the server applications
can then en- and dequeue messages using this message-queue identifier. Each message
further has to be marked with a token specifying the receiver. We then mark the RM with
identifier 0 and each invasive application with the application’s system-wide unique process
id.
8.2.3 Invasive Computing API
The following invasive interfaces are then offered to the application developer:
- setup()
This has to be executed directly after the program starts. It registers the application at the
resource manager which only sends back a message if at least a single core was allocated
to the application. Here we aim at avoiding initial resource conflicts.
- shutdown()
This sends a signal to the RM to release all resources associated to the application. The
application is then expected to exit immediately.
- invade_blocking(constraints)
We further refer to the initially suggested invade call as a blocking invade since we also
introduce non-blocking invasions. To highlight the differences, we first describe message
processing of the blocking invade call:
-
(a)
- A message including the constraints is sent to the RM and the program looses the
control flow.
-
(b)
- Then the RM optimizes the resource utilization based on all available constraints of
each invasive application. This is further discussed in Sections 8.3 and 8.4.
-
(c)
- After evaluations and optimizations inside the RM, possible changes of resources are
sent to the application and the corresponding resources are marked to be used by
the application. Such a resource update message then includes the core ids which are
infected by setting appropriate affinities (see Sec. 8.1).
-
(d)
- Finally, the control flow is given back to the program.
- invade_nonblocking(constraints)
As previously described, using a blocking call for invasions involves latencies when waiting for
feedback from the RM. These overheads lead to idling of all threads of an application while
waiting for the RM’s response. With non-blocking invasion, the RM sends resource-update
messages to the applications without a preceding invade. This allows the application to directly
process resource update messages similar to the blocking invade by changing the
number of cores and setting the pinning appropriately. Then, the constraints are
forwarded to the RM, but without waiting for the reply message with the invaded
resources.
In contrast to the blocking-invade call, a message including the constraints which describe the
changing requirements is sent to the RM with the control flow directly given back to the
application after the message was enqueued. Thus, we can overcome an idling of
cores.
- reinvade_blocking() and reinvade_nonblocking()
With the standard invade calls, constraints are always specified and forwarded to the RM. This
leads to overheads of, first of all, serializing the constraint data and, secondly, evaluating and
possibly optimizing the resource distribution on the RM side. However, if the resource
requirements do not change, we avoid this serialization and optimization procedure
with the reinvade interfaces. Here, the RM assumes that the same constraints which
were previously forwarded to the RM should be used for optimization. Even if the
current resource requirement did not change for application A, a change in resource
requirements of application B can result in a change of resources for the application
A.
- retreat()
To hand back the currently used resources, a retreat can be executed by the application. Then,
all resources except a single one is released with the remaining thread continuing execution of
the application’s master thread.