Compression Levels while storing Dask DataFrame & Dask Array

jgreen · July 19, 2022, 7:55pm

I have a NetCDF4 file (and CSV of the same data) that I would like to convert into a Zarr array and partitioned Parquet, respectively. I have seen and used the documentation on dask.dataframe.to_parquet(...) and dask.array.to_zarr(...) to initially convert and store the data in GCS, but the objective of this whole process is to benchmark read throughput and compare to TileDB Embedded (supports writing of both multi-dimensional arrays & columnar data).

In an attempt to get an “apples-to-apples” comparison, I would like to test the shared supported compression algorithms at the same compression level (e.g. 1, 2, 3,…). My question is, do the Dask DataFrame and Array APIs support passing a compression level into the conversion function, or is this something that needs to be separately done in PyArrow or Fastparquet?

pavithraes · July 22, 2022, 2:25pm

@jgreen Welcome to Dask, good question! You’ll need to pass the compression level through the storage_options for both to_parquet(pyarrow) and to_zarr.

Example:

from numcodecs import Blosc

import dask.array as da
import dask.dataframe as dd

ddf = dd.DataFrame.from_dict(
    {
        "x": list(range(5)) * 2,
        "y": range(10, 20),
        "z": range(20, 30),
    },
    npartitions=2,
)

s = da.ones((10, 1), chunks=(1, 1))

compressor = Blosc(cname='zstd', clevel=1)
da.to_zarr(s, "store_array", storage_options={"compressor": compressor})

dd.to_parquet(ddf, "store_dataframe", compression="zstd", storage_options={"compression_level": 1})

Some relevant docs:

jgreen · July 22, 2022, 2:41pm

Thanks for the reply, this is exactly what I was looking for!

Topic		Replies	Views
How to Parallel Saving Many Large Dask Arrays Distributed dask-array , delayed , future , distributed	4	366	January 17, 2023
Still cannot get rid of string conversion for blob Dask DataFrame	3	67	August 30, 2024
Slicing a dask array with a dask dataframe in one compute Dask Array dask-array , distributed	6	1565	January 14, 2022
How to write and read DataFrame with vector column (e.g. list(float64))? Dask DataFrame	2	1046	September 4, 2023
Dask team releases version 2022.04.1 Announcements	0	268	April 18, 2022

Compression Levels while storing Dask DataFrame & Dask Array

Related topics