the status of parquet in Julia

07/26/2023, 6:20 PM — 6:30 PM UTC

32-123

Abstract:

The parquet tabular data storage format has become one of the most ubiquitous, particularly in "big data" contexts where it is arguably the only binary format to successfully supplant CSV. Despite this, there are relatively few implementations of parquet, which, historically, has presented challenges for Julia. I will give a brief overview of Parquet2.jl, a pure Julia parquet implementation including comparison to other tools and formats and what is still needed to reach parity with pyarrow.

Description:

We will touch on the following:

Why did I write Parquet2.jl when Parquet.jl already existed?
Extremely quick overview of features.
Answering the often asked question: which format should I use?
A very brief mention of some idiosyncrasies of the format, some challenges of testing against the JVM implementation and why edge cases pop up.
What features are missing? How far is this from parity with the pyarrow implementation?

Platinum sponsors

Gold sponsors

Silver sponsors

Bronze sponsors

Academic partners

Local partners

Fiscal Sponsor