Error when creating pyarrow schema from dask dataframe

Hello everyone,

We are trying to build a pyarrow schema from a Dask Dataframe to write it to parquet but we’re getting an error.

We’re currently doing something like the code below to do this:

schema = pa.Schema.from_pandas(make_meta(df), preserve_index=False)

And the problem seems that we have a column that has many categories (500+) but its schema is being set to dictionary<values=string, indices=int8, ordered=0>.

So we get the error below:

ValueError: ('Long error message', "Failed to convert partition to expected pyarrow schema:\n    `ArrowInvalid('Integer value 552 not in range: -128 to 127', 'Conversion failed for column X with type category')`

Are we doing something wrong to get the pyarrow schema? Is there a better way to do it?

We are using:

  • pyarrow 10.0.1
  • dask 2023.1.0

Thanks

Milton

Hi @miltava,

OK, clearly, I’ve never worked with pyarrow, so we might want some help from someone who knows better than me at one point.

In the meantime, it is a bit difficult to help without a minimal reproducer, do you think you could make one? At least, give a bit more lines of code, do yo get the error when processing a Dask Dataframe or before?

I’ve got to other questions though (that may be completely useless):

  • Why are you using make_meta when constructing the pyarrow schema? Couldn’t you built it with the Pandas Dataframe only?
  • Or is this because you’re working from a Dask Dataframe to build the schema (this is what you’re saying :smile:)? In this case, it’s probably difficult to guess the correct column schema with only a few sample of the values.
  • Isn’t it possible to modify the schema you get and replace the dictionary<values=string, indices=int8, ordered=0> with dictionary<values=string, indices=int16, ordered=0>?
  • Are you manually creating a pyarrow schema because the default behavior of to_parquet is failing to infer the correct schema?