I have a NetCDF4 file (and CSV of the same data) that I would like to convert into a Zarr array and partitioned Parquet, respectively. I have seen and used the documentation on dask.dataframe.to_parquet(...)
and dask.array.to_zarr(...)
to initially convert and store the data in GCS, but the objective of this whole process is to benchmark read throughput and compare to TileDB Embedded (supports writing of both multi-dimensional arrays & columnar data).
In an attempt to get an “apples-to-apples” comparison, I would like to test the shared supported compression algorithms at the same compression level (e.g. 1, 2, 3,…). My question is, do the Dask DataFrame and Array APIs support passing a compression level into the conversion function, or is this something that needs to be separately done in PyArrow or Fastparquet?