We present a straightforward approach for distributed parallelization of stencil-based Julia applications on a regular staggered grid using GPUs and CPUs. The approach allows to leverage remote direct memory access and was shown to enable close to ideal weak scaling of real-world applications on thousands of GPUs. The communication performed can be easily hidden behind computation.
The approach presented here renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial. We have instantiated the approach in the Julia package
ImplicitGlobalGrid.jl. A highlight in the design of
ImplicitGlobalGrid is the automatic implicit creation of the global computational grid based on the number of processes the application is run with (and based on the process topology, which can be explicitly chosen by the user or automatically defined). As a consequence, the user only needs to write a code to solve his problem on one GPU/CPU (local grid); then, as little as three functions can be enough to transform a single GPU/CPU application into a massively scaling Multi-GPU/CPU application: a first function creates the implicit global staggered grid, a second funtion performs a halo update on it, and a third function finalizes the global grid.
ImplicitGlobalGrid relies on
MPI.jl to perform halo updates close to hardware limits. For GPU applications,
ImplicitGlobalGrid leverages remote direct memory access when CUDA- or ROCm-aware MPI is available and uses highly optimized asynchronous data transfer routines to move the data through the hosts when CUDA- or ROCm-aware MPI is not present. In addition, pipelining is applied on all stages of the data transfers, improving the effective throughput between GPU and GPU. Low level management of memory, CUDA streams and ROCm queues permits to efficiently reuse send and receive buffers and streams throughout an application without putting the burden of their management to the user. Moreover, all data transfers are performed on non-blocking high-priority streams, allowing to overlap the communication optimally with computation.
ParallelStencil.jl, e.g., can do so with a simple macro call.
ImplicitGlobalGrid is fully interoperable with
MPI.jl. By default, it creates a Cartesian MPI communicator, which can be easily retrieved together with other MPI variables. Alternatively, an MPI communicator can be passed to
ImplicitGlobalGrid for usage. As a result,
ImplicitGlobalGrid's functionality can be seamlessly extended using
The modular design of
ImplicitGlobalGrid, which heavily relies on multiple dispatch, enables adding support for other hardware with little development effort. Support for AMD GPUs using the recently matured
AMDGPU.jl package has already been implemented as a result.
ImplicitGlobalGrid supports at present distributed parallelization for CUDA- and ROCm-capable GPUs as well as for CPUs.
We show that our approach is broadly applicable by reporting scaling results of a 3-D Multi-GPU solver for poro-visco-elastic twophase flow and of various mini-apps which represent common building blocks of geoscience applications. For all these applications, nearly ideal parallel efficiency on thousands of NVIDIA Tesla P100 GPUs at the Swiss National Supercomputing Centre is demonstrated. Moreover, we have shown in a recent contribution that the approach is naturally in no way limited to geosciences: we have showcased a quantum fluid dynamics solver using the Nonlinear Gross-Pitaevski Equation implemented with
Co-authors: Ludovic Räss¹ ², Ivan Utkin¹
¹ ETH Zurich | ² Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)