dc.description.abstracteng | In high performance computing (HPC) applications, scientific or engineering problems are
solved in a highly parallel and often necessarily distributed manner. The distribution of work
leads to the distribution of data and thus also to communication between the participants of
the computation. The application programmer has many different communication libraries and
application programming interfaces (APIs) to choose from, one of the most recent libraries being
the Global Address Space Programming Interface (GASPI). This library takes advantage of the
hardware and especially interconnect developments of the past decade, enabling true remote
direct memory access (RDMA) between nodes of a cluster.
The one-sided, asynchronous semantic of GASPI routines opens multiple research questions
with respect to the implementation of collective communication routines, i.e., routines, where
a group of processors is involved in the communication. The GASPI specification itself only
offers two of these collective operations: the allreduce, computing a global result from the data
of all participants, and the barrier, constituting a synchronization point for all members of the
group. For these collective routines, appropriate underlying algorithms have to be chosen. In
the scope of the one-sided, asynchronous and split-phase semantic of GASPI collective routines,
algorithms used in other wide-spread communication libraries like the Message-Passing
Interface (MPI) may not be fitting any more. In this thesis, existing algorithms have been
reevaluated for their usability in GASPI collective routines in the context of a newly designed
library GASPI_COLL, amending the existing GASPI implementation GPI2 with additional
algorithms for the allreduce and with further collective routines: reduce and broadcast.
For the split-phase allreduce, algorithms with a butterfly-like communication scheme have been
extensively tested and found to be very suited due to their low number of communication rounds
and involvement of all participants in each communication round. This ensures few repeated
calls to the allreduce routine and also very small idling times for all nodes. One of the most
wide-spread algorithms for barrier operations, the dissemination algorithm, has been adapted
to be usable for the allreduce operation as well. The adapted n-way algorithm shows very
good results compared to the native implementation of the GPI2 allreduce and different MPI
implementations.
To make the one-sided communication semantic of GASPI manageable for the application
programmer, the GASPI specification introduces weak-synchronization primitives, notifying
the destination side of the arrival of data. This notification mechanism prevents the necessity
of global synchronization points or the waiting on multiple communication requests. This
notification mechanism has previously only been available for write-based operations but has
been extended to the read routine in the scope of this thesis, introducing gaspi_read_notify.
With this new routine, the thesis establishes the basis of a completely one-sided, asynchronous
graph exploration, implemented with the notified read operation. This enables a broader
audience to use data analytical methods on big data. Big data poses a real challenge for graph
analytical methods, because the data needs to be distributed on multiple nodes, introducing
high communication overhead if two sides are involved in the communication. This issue is
eliminated through gaspi_read_notify.
Last but not least, the potential usage of gaspi_read_notify for a distributed matrix transpose
was investigated. Not only is a matrix transpose a wide-spread communication scheme in
HPC applications, it can also be considered as a special case of an alltoall communication.
The split-phase, one-sided paradigm of GASPI collective routines, has inspired the idea of
a partially evaluable alltoallv and as a first step towards this routine, the applicability of
gaspi_read_notify for the implementation of the alltoall can be deduced from the matrix
transpose. On the available systems, this kind of implementation can not be encouraged though.
Yet, the experiments in this thesis have also shown the high dependence of communication
routines and algorithms on the underlying hardware. Thus, extensive tests on different system
architectures will have to be done in the future. | de |