BLAS & LAPACK are an integral component of many numerical algorithms. Due to their importance, a lot of effort has gone into optimizing ASM/C/Fortran implementations. Nonetheless, early work demonstrated Julia implementations were often faster than competitors, while laying groundwork for new routines specialized for new problems. We discuss a roadmap toward providing Julia BLAS and LAPACK libraries, from optimizations in LoopVectorization to libraries like Octavian and RecursiveFactorization.
The primary motivations for implementing BLAS/LAPACK in Julia are:
Some of the challenges faced in the ecosystem include:
Traditionally, BLAS and LAPACK libraries define many compute kernels, typically written in assembly for each supported architecture. Supporting code then builds the supported BLAS and LAPACK routines through applying these kernels.
Libraries such as Octavian and RecursiveFactorization followed this approach while using LoopVectorization to produce most of the kernels. An alternative approach is to use these problems to motivate and guide extending LoopVectorization to perform analysis and optimizations. At the time I submit this proposal, planned features that will be able to extend the amount of work LoopVectorization can handle automatically, thereby reducing the effort needed to implement new functions:
for m in 1:M, n in 1:m; ...; end.
for m in 1:M; for n in 1:N; end; for k in 1:K; end; end.
for m in 1:M; a[m] += a[m-1]; end.
Together, these would allow LoopVectorization to support loop nests performing cholesky factorizations or triangular solves. These should be tunable to perform with (nearly) optimal performance at small to moderate sizes for use in blockwise routines.
An orthogonal set of optimizations would be to develop a model for automatically generating blocking (working on pieces of arrays at a time that fit nicely into upper cache levels) and packing code (copying pieces of arrays into temporary buffers to avoid pessimized address calculations due to memory accesses being spread across too many pages; this also benefits hardware prefetchers, which require relatively small strides between subsequent memory accesses to trigger).
We outline a path toward building up an ecosystem through a combination of the approaches, including applications we can target -- such as stiff ODE solves benefiting from LU and triangular solves -- and see immediately benefits along the way.