OndaBatches.jl: Continuous, repeatable, and distributed batching

07/28/2023, 8:30 PM9:00 PM UTC
32-124

Abstract:

At Beacon Biosignals we don't want to have to re-invent the wheel about data loading and batch randomization every time we stand up a new machine learning project. So we've collected a set of patterns that have proven useful across multiple projects into OndaBatches.jl, which serves as a foundation for building the specific batch randomization, featurization, and data movement systems that each machine learning project requires.

Description:

At Beacon Biosignals we don't want to have to re-invent the wheel about data loading and batch randomization every time we stand up a new machine learning project. So we've collected a set of patterns that have proven useful across multiple projects into OndaBatches.jl, which serves as a foundation for building the specific batch randomization, featurization, and data movement systems that each machine learning project requires.

Our typical machine learning task involves time series datasets composed of at least thousands of multichannel recordings, each of which has on the order of 100 million individual samples, with accompanying dense or sparse labels. While not the largest machine learning datasets known to humankind, these are large enough to be generally inconvenient. The size, shape, and structure of these datasets (and the associated learning tasks) require some modifications of a typical machine learning workflow (e.g. one in which the entire dataset is processed in its entirety in each training epoch).

In this talk, I will present OndaBatches.jl, a Julia package that implements a set of patterns that have proven to be useful across a number of projects at Beacon. OndaBatches.jl serves as a foundation for building the specific batch randomization, featurization, and data movement systems that each machine learning project requires. Its purpose is to build and serve batches for machine learning workflows based on densely labeled time series data, in a way that is:

  • distributed (cloud native, throw more resources at it to make sure data movement is not the bottleneck)
  • scalable (handle out-of-core datasets, both for signal data and labels)
  • deterministic + reproducible (pseudo-random)
  • resumable
  • flexible and extensible via normal Julia mechanisms of multiple dispatch

This talk focuses on two aspects of OndaBatches.jl design and development. First, I'll describe the process of moving a local workflow into a distributed setting in order to support scalability. Second, I'll discuss how Julia's composability has shaped the design and functionality of OndaBatches.jl. In particular, OndaBatches.jl builds on...

  • ...Onda.jl to represent both the multi-channel time series that is the input data and the regularly-sampled labels.
  • ...Distributed.jl to compose well with various cluster managers (including Kubernetes via K8sClusterManagers.jl) in service of scalability.
  • ...base Julia patterns around iteration in order to separate batch state from batch content (in service of reproducibility and resumability)
  • ...Julia's multiple dispatch pattern to allow our machine learning teams to customize behavior where needed without having to re-invent basic functionality every time.

Platinum sponsors

JuliaHub

Gold sponsors

ASML

Silver sponsors

Pumas AIQuEra Computing Inc.Relational AIJeffrey Sarnoff

Bronze sponsors

Jolin.ioBeacon BiosignalsMIT CSAILBoeing

Academic partners

NAWA

Local partners

Postmates

Fiscal Sponsor

NumFOCUS