In this talk, we will be discussing some of the state of the art techniques to scale training of ML models beyond a single GPU, why they work and how to scale your own ML pipelines. We will be demonstrating how we have scaled up training of Flux models both by means of data parallelism and by model parallelism. We will be showcasing ResNetImageNet.jl and DaggerFlux.jl to accelerate training of deep learning and scientific ML models such as PINNs and the scaling it achieves.
With the scale of the datasets and the size of the models growing rapidly, one cannot reasonably train these models on a single GPU. It is no secret that that training big ML models - be they large language models, image recognition tasks, large PINNs etc - requires an immense amount of hardware, and engineering knowledge.
So far, our tools in FluxML have been limited to training on a single GPU, and there is a pressing need for tooling that can scale up training beyond a single GPU. This is important not just for current Deep Learning models but also to scale training of scientific machine learning models as we see more sophisticated neural surrogates emerge for simulations and modelling. To fulfil this need, we have developed some tools that can reliably and generically scale training of differentiable pipelines beyond a single machine or GPU device. We will be showcasing ResNetImageNet.jl and DaggerFlux.jl which uses Dagger.jl to accelerate training of various model types and the scaling it achieves.