Compression Levels while storing Dask DataFrame & Dask Array

I have a NetCDF4 file (and CSV of the same data) that I would like to convert into a Zarr array and partitioned Parquet, respectively. I have seen and used the documentation on dask.dataframe.to_parquet(...) and dask.array.to_zarr(...) to initially convert and store the data in GCS, but the objective of this whole process is to benchmark read throughput and compare to TileDB Embedded (supports writing of both multi-dimensional arrays & columnar data).

In an attempt to get an “apples-to-apples” comparison, I would like to test the shared supported compression algorithms at the same compression level (e.g. 1, 2, 3,…). My question is, do the Dask DataFrame and Array APIs support passing a compression level into the conversion function, or is this something that needs to be separately done in PyArrow or Fastparquet?

@jgreen Welcome to Dask, good question! You’ll need to pass the compression level through the storage_options for both to_parquet(pyarrow) and to_zarr.

Example:

from numcodecs import Blosc

import dask.array as da
import dask.dataframe as dd

ddf = dd.DataFrame.from_dict(
    {
        "x": list(range(5)) * 2,
        "y": range(10, 20),
        "z": range(20, 30),
    },
    npartitions=2,
)

s = da.ones((10, 1), chunks=(1, 1))

compressor = Blosc(cname='zstd', clevel=1)
da.to_zarr(s, "store_array", storage_options={"compressor": compressor})

dd.to_parquet(ddf, "store_dataframe", compression="zstd", storage_options={"compression_level": 1})

Some relevant docs:

1 Like

Thanks for the reply, this is exactly what I was looking for!

1 Like