How to upload dataframe with numpy array column using to_parquet in dask.dataframe?

hjlee9182 · August 28, 2023, 3:41am

I want to upload dataframe with array column!
Under code rise type error.
Any suggestions on how to further troubleshoot, how to fix this, or where else I could ask for help would be greatly appreciated!

import numpy as np
import pandas as pd
from dask import dataframe as dd. ## dask version == 2023.5.0

data = {
    'float_array_column': [
        np.array([1.1, 2.2, 3.3]),
        np.array([4.4, 5.5]),
        np.array([6.6, 7.7, 8.8, 9.9])
    ]
}
path = {storage_path}
storage_option = {storage_option}

df = pd.DataFrame(data)
df = dd.from_pandas(df, chunksize=5368709)

dd.to_parquet(df, path, engine='pyarrow', storage_options=storage_option)

Change data numpy to list is also got errored.

data = {
    'float_array_column': [
        [1.1, 2.2, 3.3],
        [4.4, 5.5],
        [6.6, 7.7, 8.8, 9.9]
    ]
}

ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowTypeError("Expected bytes, got a 'list' object", 'Conversion failed for column float_array_column with type object')`

Expected partition schema:
    float_array_column: string
    __null_dask_index__: int64

Received partition schema:
    float_array_column: list<item: double>
      child 0, item: double
    __null_dask_index__: int64

This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.

When using dd.from_array is also got errored.

data = {
    'float_array_column': [
        dd.from_array(np.array([1.1, 2.2, 3.3])),
        dd.from_array(np.array([4.4, 5.5])),
        dd.from_array(np.array([6.6, 7.7, 8.8, 9.9]))
    ]
}

ArrowTypeError: ("Expected bytes, got a 'Series' object", 'Conversion failed for column float_array_column with type object')

During handling of the above exception, another exception occurred:

guillaumeeb · August 28, 2023, 8:43pm

Hi @hjlee9182, welcome to Dask Discourse forum!

As indicated here for schema kwarg:

Global schema to use for the output dataset. Defaults to “infer”, which will infer the schema from the dask dataframe metadata. This is usually sufficient for common schemas, but notably will fail for object dtype columns that contain things other than strings. These columns will require an explicit schema be specified.

So you need to specify a schema in to_parquet. I’m no pyarrow expert, but I’ve been able to make it work with:

df.to_parquet('/tmp/arrayparquet', engine='pyarrow', schema={"float_array_column": pa.list_(pa.float64())})

hjlee9182 · August 29, 2023, 6:25am

Thanks! That’ very helpful for me!!

Topic		Replies	Views
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2051	January 25, 2024
How to write and read DataFrame with vector column (e.g. list(float64))? Dask DataFrame	2	1047	September 4, 2023
Error when creating pyarrow schema from dask dataframe Dask DataFrame parquet , pyarrow	2	1737	June 1, 2023
Still cannot get rid of string conversion for blob Dask DataFrame	3	69	August 30, 2024
Dask DataFrame unhashable type: 'numpy.ndarray' Dask DataFrame dask-array	3	894	June 4, 2023

How to upload dataframe with numpy array column using to_parquet in dask.dataframe?

Related topics