BLAS & LAPACK are an integral component of many numerical algorithms. Due to their importance, a lot of effort has gone into optimizing ASM/C/Fortran implementations. Nonetheless, early work demonstrated Julia implementations were often faster than competitors, while laying groundwork for new routines specialized for new problems. We discuss a roadmap toward providing Julia BLAS and LAPACK libraries, from optimizations in LoopVectorization to libraries like Octavian and RecursiveFactorization.

The primary motivations for implementing BLAS/LAPACK in Julia are:

- Because we can!
- The existence of highly optimized alternatives such as MKL provide a solid benchmark by which to assess how we're doing before applying the same optimization approaches to novel problems.
- We can adapt the routines easily to related operation types such as evaluating dense layers or miscellaneous tensor operations.
- Generic with respect to number types, whether that means mixing precision or something as exotic as Tropical Numbers (showcased in TropicalGEMM.jl).
- Ability to take advantage of compile time information and specialize, e.g. for statically sized arrays.
- Relatively painless support for new hardware, as we do not need to write assembly kernels. Feature detection and support for generating optimized code will also be tied with LLVM rather than libraries like OpenBLAS, which tend to lag far behind the compilers.

Some of the challenges faced in the ecosystem include:

- Efficient composable threading with low enough overhead to beat MKL for small array sizes.
- Compilation time or sysimage building to avoid "time to first matmul" problems.
- The implementations of BLAS and LAPACK routines themselves.

Traditionally, BLAS and LAPACK libraries define many compute kernels, typically written in assembly for each supported architecture. Supporting code then builds the supported BLAS and LAPACK routines through applying these kernels.

Libraries such as Octavian and RecursiveFactorization followed this approach while using LoopVectorization to produce most of the kernels. An alternative approach is to use these problems to motivate and guide extending LoopVectorization to perform analysis and optimizations. At the time I submit this proposal, planned features that will be able to extend the amount of work LoopVectorization can handle automatically, thereby reducing the effort needed to implement new functions:

- Allowing the bounds of inner loops to depend on the induction variables of outer loops. For example, loops of the form
`for m in 1:M, n in 1:m; ...; end`

. - Allow multiple loops to occur at the same level in the nest. For example, loops of the form
`for m in 1:M; for n in 1:N; end; for k in 1:K; end; end`

. - Model dependencies across loop iterations, and avoid violating them. For example, loops of the form
`for m in 1:M; a[m] += a[m-1]; end`

.

Together, these would allow LoopVectorization to support loop nests performing cholesky factorizations or triangular solves. These should be tunable to perform with (nearly) optimal performance at small to moderate sizes for use in blockwise routines.

An orthogonal set of optimizations would be to develop a model for automatically generating blocking (working on pieces of arrays at a time that fit nicely into upper cache levels) and packing code (copying pieces of arrays into temporary buffers to avoid pessimized address calculations due to memory accesses being spread across too many pages; this also benefits hardware prefetchers, which require relatively small strides between subsequent memory accesses to trigger).

We outline a path toward building up an ecosystem through a combination of the approaches, including applications we can target -- such as stiff ODE solves benefiting from LU and triangular solves -- and see immediately benefits along the way.