DataSets.jl: A bridge between code and data

07/29/2021, 12:30 PM12:40 PM UTC
Purple

Abstract:

In technical computing, getting data into and out of your code can be a pain. Data comes in all shapes, sizes and formats, with many different locations and storage access mechanisms.

DataSets.jl is a new package for describing data declaratively and mapping it neatly into your programs. We aim to make your code portable between data environments and remove the cruft of local paths and data access wrappers which litter technical analysis code.

Description:

DataSets.jl is an open source package for describing data format and location declaratively so that one can better separate data deserialization and access from the domain-specific analysis code which consumes that data.

To quote from the package documentation available at https://juliacomputing.github.io/DataSets.jl/dev :

DataSets.jl exists to help manage data and reduce the amount of data wrangling code you need to write. It's annoying to constantly rewrite

  • Command line wrappers which deal with paths to data storage
  • Code to load and save from various data storage systems (eg, local filesystem data; local git data, downloaders for remote data over various protocols, cloud storage access)
  • Code to load the same data model from various serializations
  • Code to deal with data lifecycle; versions, provenance, etc

DataSets.jl provides scaffolding to make this kind of code more reusable. We want to make it easy to relocate an algorithm between different data environments without code changes. For example from your laptop to the cloud, to another user's machine, or to an HPC system.

Platinum sponsors

Julia Computing

Gold sponsors

Relational AI

Silver sponsors

Invenia LabsConningPumas AIQuEra Computing Inc.King Abdullah University of Science and TechnologyDataChef.coJeffrey Sarnoff

Media partners

Packt Publication

Fiscal Sponsor

NumFOCUS