Error when creating pyarrow schema from dask dataframe

Hello everyone,

We are trying to build a pyarrow schema from a Dask Dataframe to write it to parquet but we’re getting an error.

We’re currently doing something like the code below to do this:

schema = pa.Schema.from_pandas(make_meta(df), preserve_index=False)

And the problem seems that we have a column that has many categories (500+) but its schema is being set to dictionary<values=string, indices=int8, ordered=0>.

So we get the error below:

ValueError: ('Long error message', "Failed to convert partition to expected pyarrow schema:\n    `ArrowInvalid('Integer value 552 not in range: -128 to 127', 'Conversion failed for column X with type category')`

Are we doing something wrong to get the pyarrow schema? Is there a better way to do it?

We are using:

  • pyarrow 10.0.1
  • dask 2023.1.0

Thanks

Milton

Hi @miltava,

OK, clearly, I’ve never worked with pyarrow, so we might want some help from someone who knows better than me at one point.

In the meantime, it is a bit difficult to help without a minimal reproducer, do you think you could make one? At least, give a bit more lines of code, do yo get the error when processing a Dask Dataframe or before?

I’ve got to other questions though (that may be completely useless):

  • Why are you using make_meta when constructing the pyarrow schema? Couldn’t you built it with the Pandas Dataframe only?
  • Or is this because you’re working from a Dask Dataframe to build the schema (this is what you’re saying :smile:)? In this case, it’s probably difficult to guess the correct column schema with only a few sample of the values.
  • Isn’t it possible to modify the schema you get and replace the dictionary<values=string, indices=int8, ordered=0> with dictionary<values=string, indices=int16, ordered=0>?
  • Are you manually creating a pyarrow schema because the default behavior of to_parquet is failing to infer the correct schema?

I had the same problem. In my case it happened when loading partitioned Parquet files, and I think that some partition had <255 unique values while others had more. This then failed inside to_parquet with

        dask.multiprocessing.ValueError: Failed to convert partition to expected pyarrow schema:
            `ArrowInvalid('Integer value 128 not in range: -128 to 127', 'Conversion failed for column device_id with type category')`
    
        Expected partition schema:
            device_id: dictionary<values=string, indices=int8, ordered=0>
            ...
        Received partition schema:
            device_id: dictionary<values=string, indices=int16, ordered=0>

I resolved this by manually specifying the schema parameter of to_parquet(), for the affected columns to use a larger size for the pyarrow dictionaries.

schema = {
      'device_id': pyarrow.dictionary(pyarrow.int32(), pyarrow.string()),
      'date': pyarrow.dictionary(pyarrow.int32(), pyarrow.string()),
  }
to_parquet(...., schema=schema)

Hope this helps someone, as I could not find much else on this particular error online :slight_smile: