Robust data management made simple: Introducing DataToolkit

07/28/2023, 8:00 PM — 8:30 PM UTC

32-123

Abstract:

An ad-hoc approach to acquiring and using data can seem simple at first, but leaves one ill-prepared to deal with questions like: "where did the data come from?", "how was the data processed?", or "am I looking at the same data?". Generic tools for managing data (including some in Julia) exist, but suffer from limitations that reduce their broad utility. DataToolkit.jl provides a highly extensible and integrated approach, making robust and reproducible treatment of data and results convenient.

Description:

In this talk, I will discuss the DataToolkit*.jl family of packages, which aim to enable end-to-end robust treatment of data. The small number of other projects that attempt to tackle subsets of data management —DataLad, the Kedro data catalogue, Snakemake, Nextflow, Intake, Pkg.jl's Artifacts, DataSets.jl— have a collection of good ideas, but all fall short of the convenience and robustness that is possible.

Poor data management practices are rampant. This has been particularly visible with computational pipelining tools, and so most have rolled their own data management systems (Snakemake, Nextflow, Kedro). These may work well when building/running computational pipelines, but are harder to use interactively; besides which, not everything is best expressed as a computational pipeline.

Scientific research is another area where robust data management is vitally important, yet often doesn't follow one or more of the FAIR principles (findable, accessible, interoperable, reusable). For this domain, a more general approach is needed than is offered by computational pipeline tools. DataLad and Intake both represent more general solutions, however both also fall short in major ways. DataLad lacks any sort of declarative data file (embedding info as JSON in git commit messages), and while Intake has a (YAML — see https://noyaml.com for why this isn't a great choice) data file format it only provides read-only data access, making it less helpful for writing/publishing results.

There is space for a new tool, providing a better approach to data management. Furthermore, there are strong arguments for building such a tool with Julia, not least due to the strong project management capabilities of Pkg.jl and the independence from system state provided by JLL packages. An analysis performed in Julia, with the environment specified by a Manifest.toml, the data processing captured within the same project, and the input data itself verified by checksumming, should provide a strong reproducibility guarantees — beyond that easily offered by existing tools. A data analysis project can be hermetic.

One of the aims of DataToolkit.jl is to make this not just possible, but easy to achieve. This is done by providing a capable "Data CLI" for working with the Data.toml file. Obtaining and using a dataset is as easy as:

data> create iris https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv
julia> sum(d"iris".sepal_length) # do something with the data

This exercise generates a Data.toml file that declaratively specifies the data sets of a project, how they are obtained and verified, and even capture the preprocessing required before the main analysis. DataToolkit.jl provides a declarative data catalogue file, easy data access, and automatic validation; enabling data management in accordance with the FAIR principles.

Pkg.jl's Artifact system also provides these reproducibility guarantees, but is designed specifically for small blobs of data required by packages. This shows in the design, with a number of crippling caveats for general usage, such as: the maximum file size of 2GB, only working with libcurl downloads (i.e. no GCP or S3 stored data, etc.), and being unextensible.

Even if DataToolkit.jl works well in some situations, to provide a truly general solution it needs to be able to adapt to unforeseen use cases — such as custom data formats/access methods, and integration with other tools. To this end, DataToolkit.jl also contains a highly flexible plugin framework that allows for the fundamental behaviours of the system to be drastically altered. Several major components of DataToolkit.jl are in fact provided as plugins, such as the default value, logging, and memorisation systems.

While there is some overlap between DataToolkit.jl and DataSets.jl, DataToolkit.jl currently provides a superset of the capabilities (for instance the ability to express composite data sets, built from multiple input data sets) with more features in development; and the design of DataToolkitBase.jl allows for much more to be built on it.

The current plan for the talk itself is: