These html pages are based on the
PhD thesis "Cluster-Based Parallelization of Simulations on Dynamically Adaptive Grids and Dynamic Resource Management" by Martin Schreiber.
There is also
more information and a PDF version available.
3.4 Related work
With a wide range of already existing research for dynamic grid creation and management, we give a
brief overview of the related work. We describe this work top-down, starting with the pure
mesh-generation and decomposition approaches, followed by frameworks to run simulations on
dynamically changing meshes and then related work which investigated the Sierpiński SFC for running
simulations.
3.4.1 Dynamic mesh partitioning
We first introduce different approaches for dynamic mesh partitioning as well as dynamic mesh
rebalancing and assume the mesh to be already given. We do not aim for a complete literature survey,
but only give an overview of most related work.
Graph partitioner
One way to deal with mesh generation and its partitioning is to consider the mesh as a graph
connecting mesh elements via their hyperfaces and apply graph partitioning algorithms on this graph.
Such HPC partitioning algorithms aim at creating partitions of meshes with very good
properties for efficient parallelization. One optimization property is e.g. reducing the number of
interfaces shared with other partitions to reduce the data to be communicated by reducing
the number of edge cuts for two-dimensional meshes and face cuts for three-dimensional
meshes.
For dynamically adaptive meshes, additional optimization objectives for optimized remeshing are given
by minimizing communication data, considering the costs for meshes after load balancing [KSK03] and
maximizing quality of load balancing. Examples for such graph-based meshing and partitioning tools are
e.g. ParMETIS [KK98]
and PT-Scotch [CP08] .
Geometric partitioner using SFCs
Instead of using graph theory for optimized mesh partitioning, the underlying geometry and spatial
placement of the mesh elements can be used. Such geometric partitioners can be based on e.g. SFCs
curves. They yield one property which is of particular importance for partitioning: The surjective mapping
of each point in an d-dimensional space to a one-dimensional interval, retaining the location of the
d-dimensional space in the one-dimensional representation [MJFS01]. Thus, load balancing can be
reduced to a one-dimensional representation: each partition is assigned an interval of the one-dimensional
problem with similar workload [BZ00]. Libraries supporting SFC-based partitioning are e.g.
amatos [BRH+05] and
Zoltan [BCCD12] .
Graph- vs. SFC-based partitioning
Handling dynamic partitioning based on graph partitioners obviously involves complex graph
operations [BZ00], resulting in an NP-hard problem [Zum00]. Even with a multilevel diffusive
scheduling to compensate the NP-hard problem, a partitioning based on SFCs still leads to faster load
balancing while still yielding partition optimality regarding the edge cut close to graph-based partition
algorithms [Mit07].
3.4.2 Simulation software including grid generation
We give an overview of well-known frameworks with a built-in grid generation. Again, we also do not
aim for a complete survey, but only consider the work which is of most relevance for our considered
simulations. We compare the possibilities of each framework regarding our requirements described in
the previous sections.
- OpenFOAM: Features of OpenFOAM include a PDE toolbox implemented with an
embedded language, capabilities for simulations of CFD with different models, and
also dynamic mesh generation. For adaptive mesh refinement which is required for
efficient simulation of wave propagations, OpenFOAM requires additional memory to store
e.g. connectivity information on cells.
- Dune: The Dune Project
(Distributed and Unified Numerics Environment) also offers dynamic adaptive grids with
the ALUGrid module with ParMETIS [BBD+08]. Dune stores connectivity information
on each cell and therefore also requires additional memory similar to OpenFOAM.
- p4est: The p4est [BWG11] software library offers mesh management for parallel dynamic
AMR-based forest of octrees and is only partly an entire simulation framework by itself. It
already proofed its high scalability [BGG+08] also in the context of DG simulations. Since
it is based on octrees, this obviously yields hanging nodes. Since we require triangular
grids, inserting edges to each cell can be used to reconstruct triangles, however this would
lead to requirements of rearranging mesh triangles in case of additional hanging nodes
created or removed in a cell.
- Peano: This framework is a mesh-based PDE solver on Cartesian grids with the
mesh generated by the Peano SFC. Outstanding features are e.g. support for arbitrary
dimensions and dynamically adaptive mesh refinement with a recursive trisection in each
dimension [Nec09,Wei09]. Since it is based on Cartesian grids, the drawbacks are similar
to those of p4est with our requirements of triangle cells.
With the memory efficiency requirement for dynamically adaptive simulations on triangular grids
being the main driving requirement of this work, none of the frameworks mentioned above, and to our
best knowledge also no other frameworks fulfill our demands.
3.4.3 Related software development
This section outlines our contributions compared to the preceding work on the serial and
parallel simulation with the Sierpiński stack- and stream-based approach. Based on the
dissertation of Csaba Vigh [Vig12], our major algorithmic differences and achievements are the
following:
- Our adaptivity automaton and the cluster-based parallelization allows skipping of
adaptivity traversals with already consistent clusters.
- The parallelization in [Vig12] is only focused on MPI, whereas our run-length encoded
communication scheme and the way how the meta information is updated provides an
efficient parallelization for shared- and distributed-memory systems.
- We introduced an efficient node-based communication scheme for the parallelization.
- The Riemann solvers in our work are more sophisticated and require an improved
communication scheme.
- In [Vig12], a data migration is presented with an enumeration of cells following the SFCs.
These numbers are propagated via edges after refinement and coarsening operations. In
contrast, our approach does not require such a global information, resulting e.g. in a
considerable improvement for data migration.
3.4.4 Impact on and from related work
The serial traversal with the stack- and stream-based approach used in this thesis is based on the
preceeding research on serial traversals, e.g. [GMPZ06,BSVB08]. In our work, we developed a
software design and algorithms for an efficient parallelization on shared- as well as distributed-memory
systems.
There are several groups working on SFC-based traversal and organization schemes for PDE
solvers; to our knowledge, only the work of Prof. Michael Bader and his co-workers is also Sierpiński
-based and is called sam(oa)2. Their work has the closest relation to this thesis, and that’s why we
mention it here. Their focus was initially on a software design and a parallelization approach which
only supports MPI parallelization, see e.g. [Vig12] which was developed in cooperation with Kaveh
Rahnema.
[ERRATA START]
Researchers are typically obliged to cite ideas and related work of others.
However, sometimes it's required to publish also information on permanent redevelopments and that ideas are taken from others.
This is the reason why I'm now publishing this information as part of an errata.
The idea of a cluster-based parallelization concept (e.g. independent storage for partitions, run-length encoded communication system which is updated implicitly by adativity markers, etc.) was developed at the end of 2010 and the first parallel (shared-memory version) version was finished in early 2011, see also
https://bitbucket.org/schreiberx/sierpi/.
This was explained to Prof. Michael Bader at the beginning of 2011.
In particular, based on my development I've provided Prof. Michael Bader plenty of new ideas and concepts on algorithmic concepts of solving simulations with dynamic adaptive mesh refinement in 2010 and 2011.
These ideas were orthogonal to their development which was described in a preliminary version of the PhD thesis of Csaba Vigh which Csaba gave to me at the end 2010 or beginning of 2011.
Most of the underlying ideas and concepts were explained to Prof. Michael Bader in 2011 and in particular to Oliver Meister in 2011/2012, see e.g. the
presentation of the chair meeting. This also includes in particular the suggested extensions of the clustering approach to SFC cuts (extension from tree splits to intervals of SFC).
The limitations of their original approach (e.g. MPI-only, no cluster-based optimizations, no optimizations of residual-aware iterative solvers via cluster-skipping, etc.) and how the cluster-related algorithms work were explained to Oliver Meister when trying to convince him that we should join our forces.
The task of merging the original parallelization approach (
developed by Kaveh Rahnema and Csaba Vigh) was assigned to Oliver Meister in mid 2012.
We will briefly outline how Oliver Meister, one of the main samoa2 developers who was assigned to merge the original parallelization in samoa 2, decided to drop their previous approach in mid 2012.
This is based on memory logs, hence the statements are not 1:1 accurate.
-
Oliver Meister (beg 2011): Among others, I've explained Oliver Meister during a visit in Stuttgart around end of 2010/beginning of 2011 my software concepts of using C++ and a code generator.
He also reprogrammed this as part of a student's work "Miriam Greis, Jessica Hacklaender, Pascal Hirmer: Systemanalyse und Workflowverwaltung eines Frameworks fuer Cache-effiziente adaptive Simulation" http://elib.uni-stuttgart.de/opus/volltexte/2012/7375/ .
Oliver Meister never showed any intention of any collaboration or sharing of this work.
(Google made me aware of this work in 2013).
- Oliver Meister (mid 2012): ``Michael (Prof. Michael Bader) wants me to merge my development with the parallelization approach of Kaveh however this doesn't make any sense.
I'm not doing so, but I'm now also implementing something which supports cluster-based local time stepping".
Oliver Meister decided once more to follow ideas of others and decided to drop the previous parallelization approach into the Trash bin.
Here, Oliver Meister also broke the agreement which was made by myself with Kaveh Rahnema (Oliver Meister attended this meeting) who was originally responsible for the parallelization of the sam(oa)2 framework.
This agreement was to stick to the old parallelization approach and not to reprogram the clustering approach.
- Prof. Michael Bader (mid 2012): Presentation at the Newton Institute
"Parallelization and Software Concepts for Tsunami Simulation on Dynamically Adaptive Triangular Grids", Isaac Newton Institute, http://rss.sms.cam.ac.uk/media/1292670,
http://www.newton.ac.uk/programmes/AMM/seminars/2012082414301.pdf,
with performance results indicating that their work is based on
the PhD thesis of Csaba Vigh (In fact, this was developed together with Kaveh Rahnema).
I should mention here that at this point in time Prof. Michael Bader was probably not aware of Oliver Meister's decision not to follow his order but to "also implement[...] something which supports cluster-based local time stepping".
(See e.g. also the video recording of this event).
- Oliver Meister (end 2012): ``We don't call them clusters, we call them segments".
- Oliver Meister (end 2012): ``I'm now also first implementing OpenMP. Then it's easy to extend it to MPI". Here I'd like to mention that a clustering approach as it was presented to them first results in an OpenMP implementation (multiple partitions per program context) which can then be easily extended to MPI as it was explained to him.
- Kaveh Rahnema (end 2012): ``Is it OK that we're using your run-length encoding?" My answer: ``Yes, of course". Here I should mention that with their original approach from 2010, using the RLE neither allows OpenMP parallelization, nor optimal data migration, nor cluster-based optimizations such as workstealing on shared- and distributed memory systems.
- Oliver Meister (beg 2013) After offering him a co-authorship on a joined paper on the cluster-based parallelization concept: ``Michael [Bader, Prof.] first wants me to finish the parallelization before I can write on a paper"... and I've waited...
- Oliver Meister (mid 2013) Presented the new parallelization in samoa2 including OpenMP as an extension of the original parallelization concept.
- Oliver Meister (mid 2013) Claimed ``These are all my ideas".
Here it should be also stated that an algorithm to reconstruct the RLE-based connectivity information with a spanning tree was in-fact developed by himself.
- Oliver Meister (mid 2013) After asking him about planned publications: ``Yes, we [not including the originator of the ideas] are going to publish it".
- Oliver Meister (mid 2013) After asking him a follow-up question how they can publish this work with the related work from me not yet published (e.g. the data migration, 0-length RLE for nodes):
``We're simply not citing it".
- Prof. Michael Bader (mid 2013):
An about 3 hour conversation with Prof. Michael Bader without mentioning the previous statement of Oliver Meister since this would probably cost him his PhD position.
The work on cluster-based optimizations was already accepted at the EuroPar 2013 and there was no reason to discuss this.
No concrete agreement was made.
- As part of the GeoScience conference, they were still honest in the abstract "Abstract of the GeoScience conference 2013":
"... discuss a run-length encoding to deal with the data exchange on the shared inter-process ..."
showing that they reprogrammed the run-length encoding, but this was the last time that this was stated.
- Oliver Meister (end 2013): He asked if I've published the data migration since they also want to publish something.
I've decided to write the final paper in 2014 in a way which also covers the extension to SFC cuts as it was explained to Oliver Meister before he started ``implementing something which also supports cluster-based local time stepping".
- Oliver Meister (beg 2014): Again, he asked if I published the data migration.
A list with related work with published the ideas which Oliver Meister reprogrammed was given to him in 2014 (RLE edge-comm - HiPC, RLE node-comm ParCo, cluster-based optimization EuroPar, data migration ICCS).
There's no excuse in claiming that he was not aware of a way to properly cite the related work.
- (2015-...) See publications and presentations of Prof. Michael Bader and Oliver Meister by simply claiming that they can do MPI+OpenMP parallelization without providing details on the underlying algorithms.
It is not the duty of researchers to cite what others (Oliver Meister, Prof. Michael Bader and Kaveh Rahnema) are successively reprogramming and that they intended to rename it -- "We don't call them cluster, we call them segments" (statement of Oliver Meister, 2013).
As a matter of honesty, it is the task of researchers to cite the related work which they reprogrammed, to offer a co-authorship to the person who significantly contributed by providing ideas, concepts and strategies.
It's also important that real researchers are honest about their decision of dropping their former approach (otherwise Csaba Vigh would be still on the author list of their source code) and to reprogram the approach of others.
It seems to be appropriate to publish this and further comments as part of an errata of my PhD thesis.
To make a long story short, Oliver Meister steadily reprogrammed implementations and ideas.
There was never any will of cooperation by Oliver Meister and he reprogrammed ideas and concepts from the very beginning on.
Only the concept which I've personally developed as part of my PhD allows efficient MPI/OpenMP/TBB parallelization, cluster-based optimizations, node-based communication, efficient updating of RLE data, etc.
We will see if Prof. Michael Bader, Oliver Meister and Kaveh Rahnema cited the sources correctly (citations can be also given by referencing presentations, communications, etc.)...
Finally, to correctly bridge the text to the next paragraph, the cluster-based parallelization concept and many other algorithms and potential features which arise with this novel parallelization concept were explained to Prof. Michael Bader in 2011.
[ERRATA END]
Several discussions with him and his group led to certain assimilations of our approaches.
However, the major difference still lies in their focus on SFC cuts for partitioning. These SFC
cuts lead to further research-related challenges such as the construction of consistent meta
information.