the status of parquet in Julia

07/26/2023, 6:20 PM — 6:30 PM UTC
32-123

Abstract:

The parquet tabular data storage format has become one of the most ubiquitous, particularly in "big data" contexts where it is arguably the only binary format to successfully supplant CSV. Despite this, there are relatively few implementations of parquet, which, historically, has presented challenges for Julia. I will give a brief overview of Parquet2.jl, a pure Julia parquet implementation including comparison to other tools and formats and what is still needed to reach parity with pyarrow.

Description:

We will touch on the following:

  • Why did I write Parquet2.jl when Parquet.jl already existed?
  • Extremely quick overview of features.
  • Answering the often asked question: which format should I use?
  • A very brief mention of some idiosyncrasies of the format, some challenges of testing against the JVM implementation and why edge cases pop up.
  • What features are missing? How far is this from parity with the pyarrow implementation?

Platinum sponsors

JuliaHub

Gold sponsors

ASML

Silver sponsors

Pumas AIQuEra Computing Inc.Relational AIJeffrey Sarnoff

Bronze sponsors

Jolin.ioBeacon BiosignalsMIT CSAILBoeing

Academic partners

NAWA

Local partners

Postmates

Fiscal Sponsor

NumFOCUS