Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False

ave4000 · January 25, 2024, 1:11pm

Hi,

I am currently working with data that has list of integers as value in the column. We are working with parquet files extensively and I am currently facing an issue when trying to read and write these files using dask.
With convert-string as False there’s no issue because dask parses these types to string, but I want to be able to work with the data without the need of parsing it back to proper format every time.

Code to reproduce:

import dask
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": False})

a = pd.DataFrame({"a":[[1,2,3]],"b":[[1,2,3]]})
ddf = dd.from_pandas(a,npartitions=1)

ddf.to_parquet("testdd.parquet")

When triggering to_parquet() I am getting:

ValueError: Failed to convert partition to expected pyarrow schema:
    `ArrowTypeError("Expected bytes, got a 'list' object", 'Conversion failed for column a with type object')`

Expected partition schema:
    a: string
    b: string
    __null_dask_index__: int64

Received partition schema:
    a: list<item: int64>
      child 0, item: int64
    b: list<item: int64>
      child 0, item: int64
    __null_dask_index__: int64

This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.

I noticed that the support of these dtypes are fairly new:
https://docs.dask.org/en/stable/changelog.html#dtype-inference-in-read-parquet

It makes me wonder if I am missing something or .to_parquet() does not fully support these dtypes and I have to utilize pandas for that?

ave4000 · January 25, 2024, 1:41pm

Hi,

The Related section helped me find a solution.

In case somebody struggles, this one works just fine:

import dask
import dask.dataframe as dd
import pandas as pd
dask.config.set({"dataframe.convert-string": False})

import pandas as pd
a = pd.DataFrame({"int_array_column1":[[1,2,3]],"int_array_column2":[[1,2,3]]})
ddf = dd.from_pandas(a,npartitions=1)

ddf_dd = ddf.to_parquet("testdd.parquet", schema={
    "int_array_column1": pa.list_(pa.int64()),
    "int_array_column2": pa.list_(pa.int64()),
    })

Please feel free to close the thread, apologies for the mess, hopefully somebody can use this.

Topic		Replies	Views
How to write and read DataFrame with vector column (e.g. list(float64))? Dask DataFrame	2	1048	September 4, 2023
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	807	August 29, 2023
Still cannot get rid of string conversion for blob Dask DataFrame	3	70	August 30, 2024
Error when creating pyarrow schema from dask dataframe Dask DataFrame parquet , pyarrow	2	1737	June 1, 2023
DDF is converting column of lists/dicts to strings Dask DataFrame	2	1024	January 18, 2024

Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False

Related topics