Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False

Hi,

I am currently working with data that has list of integers as value in the column. We are working with parquet files extensively and I am currently facing an issue when trying to read and write these files using dask.
With convert-string as False there’s no issue because dask parses these types to string, but I want to be able to work with the data without the need of parsing it back to proper format every time.

Code to reproduce:

import dask
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": False})

a = pd.DataFrame({"a":[[1,2,3]],"b":[[1,2,3]]})
ddf = dd.from_pandas(a,npartitions=1)

ddf.to_parquet("testdd.parquet")

When triggering to_parquet() I am getting:

ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowTypeError("Expected bytes, got a 'list' object", 'Conversion failed for column a with type object')`

Expected partition schema:
    a: string
    b: string
    __null_dask_index__: int64

Received partition schema:
    a: list<item: int64>
      child 0, item: int64
    b: list<item: int64>
      child 0, item: int64
    __null_dask_index__: int64

This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.

I noticed that the support of these dtypes are fairly new:
https://docs.dask.org/en/stable/changelog.html#dtype-inference-in-read-parquet

It makes me wonder if I am missing something or .to_parquet() does not fully support these dtypes and I have to utilize pandas for that?

Hi,

The Related section helped me find a solution.

In case somebody struggles, this one works just fine:

import dask
import dask.dataframe as dd
import pandas as pd
dask.config.set({"dataframe.convert-string": False})

import pandas as pd
a = pd.DataFrame({"int_array_column1":[[1,2,3]],"int_array_column2":[[1,2,3]]})
ddf = dd.from_pandas(a,npartitions=1)

ddf_dd = ddf.to_parquet("testdd.parquet", schema={
    "int_array_column1": pa.list_(pa.int64()),
    "int_array_column2": pa.list_(pa.int64()),
    })

Please feel free to close the thread, apologies for the mess, hopefully somebody can use this.

1 Like