How to upload dataframe with numpy array column using to_parquet in dask.dataframe?

I want to upload dataframe with array column!
Under code rise type error.
Any suggestions on how to further troubleshoot, how to fix this, or where else I could ask for help would be greatly appreciated!

import numpy as np
import pandas as pd
from dask import dataframe as dd. ## dask version == 2023.5.0

data = {
    'float_array_column': [
        np.array([1.1, 2.2, 3.3]),
        np.array([4.4, 5.5]),
        np.array([6.6, 7.7, 8.8, 9.9])
    ]
}
path = {storage_path}
storage_option = {storage_option}

df = pd.DataFrame(data)
df = dd.from_pandas(df, chunksize=5368709)

dd.to_parquet(df, path, engine='pyarrow', storage_options=storage_option)

Change data numpy to list is also got errored.

data = {
    'float_array_column': [
        [1.1, 2.2, 3.3],
        [4.4, 5.5],
        [6.6, 7.7, 8.8, 9.9]
    ]
}
ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowTypeError("Expected bytes, got a 'list' object", 'Conversion failed for column float_array_column with type object')`

Expected partition schema:
    float_array_column: string
    __null_dask_index__: int64

Received partition schema:
    float_array_column: list<item: double>
      child 0, item: double
    __null_dask_index__: int64

This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.

When using dd.from_array is also got errored.

data = {
    'float_array_column': [
        dd.from_array(np.array([1.1, 2.2, 3.3])),
        dd.from_array(np.array([4.4, 5.5])),
        dd.from_array(np.array([6.6, 7.7, 8.8, 9.9]))
    ]
}
ArrowTypeError: ("Expected bytes, got a 'Series' object", 'Conversion failed for column float_array_column with type object')

During handling of the above exception, another exception occurred:

Hi @hjlee9182, welcome to Dask Discourse forum!

As indicated here for schema kwarg:

Global schema to use for the output dataset. Defaults to “infer”, which will infer the schema from the dask dataframe metadata. This is usually sufficient for common schemas, but notably will fail for object dtype columns that contain things other than strings. These columns will require an explicit schema be specified.

So you need to specify a schema in to_parquet. I’m no pyarrow expert, but I’ve been able to make it work with:

df.to_parquet('/tmp/arrayparquet', engine='pyarrow', schema={"float_array_column": pa.list_(pa.float64())})

Thanks! That’ very helpful for me!!

1 Like