And the problem seems that we have a column that has many categories (500+) but its schema is being set to dictionary<values=string, indices=int8, ordered=0>.
So we get the error below:
ValueError: ('Long error message', "Failed to convert partition to expected pyarrow schema:\n `ArrowInvalid('Integer value 552 not in range: -128 to 127', 'Conversion failed for column X with type category')`
Are we doing something wrong to get the pyarrow schema? Is there a better way to do it?
OK, clearly, I’ve never worked with pyarrow, so we might want some help from someone who knows better than me at one point.
In the meantime, it is a bit difficult to help without a minimal reproducer, do you think you could make one? At least, give a bit more lines of code, do yo get the error when processing a Dask Dataframe or before?
I’ve got to other questions though (that may be completely useless):
Why are you using make_meta when constructing the pyarrow schema? Couldn’t you built it with the Pandas Dataframe only?
Or is this because you’re working from a Dask Dataframe to build the schema (this is what you’re saying )? In this case, it’s probably difficult to guess the correct column schema with only a few sample of the values.
Isn’t it possible to modify the schema you get and replace the dictionary<values=string, indices=int8, ordered=0> with dictionary<values=string, indices=int16, ordered=0>?
Are you manually creating a pyarrow schema because the default behavior of to_parquet is failing to infer the correct schema?
I had the same problem. In my case it happened when loading partitioned Parquet files, and I think that some partition had <255 unique values while others had more. This then failed inside to_parquet with
dask.multiprocessing.ValueError: Failed to convert partition to expected pyarrow schema:
`ArrowInvalid('Integer value 128 not in range: -128 to 127', 'Conversion failed for column device_id with type category')`
Expected partition schema:
device_id: dictionary<values=string, indices=int8, ordered=0>
...
Received partition schema:
device_id: dictionary<values=string, indices=int16, ordered=0>
I resolved this by manually specifying the schema parameter of to_parquet(), for the affected columns to use a larger size for the pyarrow dictionaries.