We present an efficient approach for writing architecture-agnostic parallel high-performance stencil computations in Julia. Powerful metaprogramming, costless abstractions and multiple dispatch enable writing a single code that is usable for both productive prototyping on a single CPU and for production runs on GPU or CPU workstations or supercomputers. Performance similar to CUDA C is achievable, which is typically a large improvement over reachable performance with
CUDA.jl Array programming.
Our approach for the expression of architecture-agnostic high-performance stencil computations relies on the usage of Julia's powerful metaprogramming capacities, costless high-level abstractions and multiple dispatch. We have instantiated the approach in the Julia package
ParallelStencil, a simple call to the macro
@parallel is enough to parallelize and launch a kernel that contains stencil computations, which can be expressed explicitly or with math-close notation. The package used underneath for parallelization is defined in a initialization call beforehand. Currently supported are
CUDA.jl for running on GPU and
Base.Threads for CPU. Leveraging metaprogramming,
ParallelStencil automatically generates high-performance code suitable for the target hardware, and automatically derives kernel launch parameters from the kernel arguments by analyzing the extensions of the contained arrays. A set of architecture-agnostic low level kernel language constructs allows for explicit low level kernel programming when useful, e.g., for the explicit control of shared memory on the GPU (these low level constructs are GPU-computing-biased).
Arrays are automatically allocated on the hardware chosen for the computations (GPU or CPU) when using the allocation macros provided by
ParallelStencil, avoiding any need of code duplication. Moreover, the allocation macros are fully declarative in order to let
ParallelStencil choose the best data layout in memory. Notably, logical arrays of structs (or of small arrays) can be either laid out in memory as arrays of structs or as structs of arrays accounting for the fact that each of these allocation approaches has its use cases where it performs best.
ParallelStencil is seamlessly interoperable with packages for distributed parallelization, as e.g.
MPI.jl, in order to enable high-performance stencil computations on GPU or CPU supercomputers. Communication can be hidden behind computation with as simple macro call. The usage of this feature solely requires that communication can be triggered explicitly as it is possible with, e.g,
We demonstrate the wide applicability of our approach by reporting on several multi-GPU solvers for geosciences as, e.g., 3-D solvers for poro-visco-elastic twophase flow and for reactive porosity waves. As reference, the latter solvers were ported from MPI+CUDA C to Julia using
ImplicitGlobalGrid and achieve 90% and 98% of the performance of the original solvers, respectively, and a nearly ideal parallel efficiency on thousands of NVIDIA Tesla P100 GPUs at the Swiss National Supercomputing Centre. Moreover, we have shown in recent contributions that the approach is naturally in no way limited to geosciences: we have showcased a computational cognitive neuroscience application modelling visual target selection using
MPI.jl and a quantum fluid dynamics solver using the Nonlinear Gross-Pitaevski Equation implemented with
Co-authors: Ludovic Räss¹ ²
¹ ETH Zurich | ² Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)