Scheduling decisions

These html pages are based on the PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also more information and a PDF version available.

[next] [prev] [prev-tail] [tail] [up]

8.4 Scheduling decisions

The content and structure of this section is related to our work [SRNB13b] which is currently under review. The optimized target resource distribution is computed based on the previously introduced data structures as the specified optimization target and as an per-application specified information.

We further drop the core dependencies of our original optimization function (8.2), yielding the simplified optimization function

⃗(i+1) ⃗ (i) ⃗ ⃗ D := foptimize(D ,P,T ).

(8.3)

Optimizations are then applied with the constraints of all applications depending on and the per-application optimization target .

Requirements on constraints: The constraints which are forwarded by resource-aware applications to the RM are then kept in with one entry for each application. Then, the RM schedules the available resources based on the optimization target and these constraints. Here, we distinguish between local (optimizing resources for a single application) and global constraints (optimizing resources for multiple applications).

Local constraints: With constraints given by the range of cores between 1 and the maximum number of cores, an application can request a particular range of cores. These constraints make is challenging to optimize concurrently running applications since no knowledge on their performance state for a changing number of resources can be inferred and we refer to such constraints as local ones.

Global constraints: Such constraints can be evaluated by the optimization function in a way which optimizes the resources targeting at a global optimum of all applications.

Application’s workload: If running similar applications, the workload can be used to schedule resources. This is due to the workload also used e.g. in load balancing. Our target function then redistributed R compute resources to each application A_i with $⌈ ⌉ ⃗D := ∑R-⋅ ⃗Wi - α , α ∈ {0,1}. i j ⃗Wj i i$ Hence, it assigns _i resources to the application _i. To avoid over-subscription (see Eq. (8.1)), α has to be chosen in a particular way.
With the assigned number of resources _i, the problem can be reformulated to a scalability optimization. Here, we assume that each application has a perfect strong scalability S(c) for c cores within the range [1;_i] and no performance gain beyond _i cores:
$S (c) := min(c, ⃗Di).$ We can then assign the cores to the applications by adding a core until this leads to no further gain in performance. Obviously, this approximation with a scalability graph (explained in the next paragraph) is only an approximation of the real scalability graph. Using real scalability graphs provided by application are discussed next as an alternative.
Application’s scalability graph: Here, we assume that each application messages its over-time changing strong scalability graph to the RM. There is a linear dependency of the applications scalability and the workload throughput.
For a given number of cores c, the throughput T(c) represents the fraction of time to compute a solution for a fixed problem size w = _i. Next, we consider the throughput improvement for a changing number of cores with the baseline given at the throughput of a single core: = =: S(c). This yields the relation to the scalability graph of strong scalability S(c). Hence, we can use the scalability graph as an optimization hint for the application’s throughput. Furthermore, we can use a scalability graph to optimize for the theoretical global maximum throughput even among heterogeneous applications due to the normalization S(1) = 1 for a single core.
Let the scalability graph S_i(c) be provided by application _i. We further require a strictly monotonously increasing behavior
$Si(c) - Si(c - 1) > 0 ,$ as well as a concavity $Si(c+ 1)- Si(c) ≤ Si(c) - Si(c - 1).$ This also assumes that no super-linear speedups are possible.
The global throughput is then maximized by searching the most efficient resource-to-application combination in . This yields a multi-variate maximization problem in _j for our optimization target

$( ) ∑ ⃗ ma⃗Dx Si(Dj) . i$ (8.4)

Again, we avoid over-subscription with the side constraint ∑ _j_j ≤ R.

Figure 8.2: Two scalability graphs are given. The one for application A has an increasing number of cores from left to right and the one for application B has an increasing number of cores from right to left. The maximum throughput searches for the resource allocation to both application which maximizes the overall throughput [SRNB13a].

Fig. 8.2 shows an example of such an optimization. Two applications are considered which provide a scalability graph. The graph with the red solid line represents the scalability of the first application with increasing number of resources from left to right and the green dashed line represents the scalability graph of the second application with the numbers of resources increasing from right to left.
To optimize the resource distribution for our “maximizing throughput” optimization target, the optimal point is given by the maximum of the sum of both throughput graphs.
However, we can use our requested properties of strictly monotonously increasing and concave scalability graphs. This allows us to solve the maximization problem for an arbitrary number of applications with an iterative gradient method which is related to the steepest descent solver [FP63] and assuming that there is only one optimum, the global optimum.
Initialization: The iteration vector ^(k) for the k-th iteration is introduced which assigns _i computing cores to each application i. We start with ⁽⁰⁾ := (1,1,…,1) which assigns each application a single core at the start. This is required since each application demands for at least a single core on which it is executed.
Iteration: We then compute the throughput improvement for each application i, for a single core additionally assigned to it:

$ΔSi := Si(⃗Bi + 1)- Si(⃗Bi).$ (8.5)

Then, the application n, which yields the maximum throughput improvement is given by ΔS_n := max_j{ΔS_j}.
We can update the resource distribution for the k + 1-th iteration by

$B⃗(k+1) := ⃗B(k) + δi,n$ (8.6)

with the Kronecker symbol δ.
Stopping criterion: If all resources are distributed (∑ _i_i^(k) = R), we can stop our optimization and the last iteration vector contains the optimized target resource distribution for ^(k+1).

[next] [prev] [prev-tail] [front] [up]