Hi,
I am currently working with data that has list of integers as value in the column. We are working with parquet files extensively and I am currently facing an issue when trying to read and write these files using dask.
With convert-string as False there’s no issue because dask parses these types to string, but I want to be able to work with the data without the need of parsing it back to proper format every time.
Code to reproduce:
import dask
import dask.dataframe as dd
import pandas as pd
dask.config.set({"dataframe.convert-string": False})
a = pd.DataFrame({"a":[[1,2,3]],"b":[[1,2,3]]})
ddf = dd.from_pandas(a,npartitions=1)
ddf.to_parquet("testdd.parquet")
When triggering to_parquet() I am getting:
ValueError: Failed to convert partition to expected pyarrow schema:
`ArrowTypeError("Expected bytes, got a 'list' object", 'Conversion failed for column a with type object')`
Expected partition schema:
a: string
b: string
__null_dask_index__: int64
Received partition schema:
a: list<item: int64>
child 0, item: int64
b: list<item: int64>
child 0, item: int64
__null_dask_index__: int64
This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.
I noticed that the support of these dtypes are fairly new:
https://docs.dask.org/en/stable/changelog.html#dtype-inference-in-read-parquet
It makes me wonder if I am missing something or .to_parquet() does not fully support these dtypes and I have to utilize pandas for that?