How to upload dataframe with numpy array column using to_parquet in dask.dataframe?

guillaumeeb · August 28, 2023, 8:43pm

Hi @hjlee9182, welcome to Dask Discourse forum!

As indicated here for schema kwarg:

Global schema to use for the output dataset. Defaults to “infer”, which will infer the schema from the dask dataframe metadata. This is usually sufficient for common schemas, but notably will fail for object dtype columns that contain things other than strings. These columns will require an explicit schema be specified.

So you need to specify a schema in to_parquet. I’m no pyarrow expert, but I’ve been able to make it work with:

df.to_parquet('/tmp/arrayparquet', engine='pyarrow', schema={"float_array_column": pa.list_(pa.float64())})

Topic		Replies	Views
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2038	January 25, 2024
How to write and read DataFrame with vector column (e.g. list(float64))? Dask DataFrame	2	1044	September 4, 2023
Dask DataFrame unhashable type: 'numpy.ndarray' Dask DataFrame dask-array	3	892	June 4, 2023
Error when creating pyarrow schema from dask dataframe Dask DataFrame parquet , pyarrow	2	1732	June 1, 2023
Still cannot get rid of string conversion for blob Dask DataFrame	3	65	August 30, 2024

How to upload dataframe with numpy array column using to_parquet in dask.dataframe?

Related topics