Hello everyone,
We are trying to build a pyarrow schema from a Dask Dataframe to write it to parquet but we’re getting an error.
We’re currently doing something like the code below to do this:
schema = pa.Schema.from_pandas(make_meta(df), preserve_index=False)
And the problem seems that we have a column that has many categories (500+) but its schema is being set to dictionary<values=string, indices=int8, ordered=0>
.
So we get the error below:
ValueError: ('Long error message', "Failed to convert partition to expected pyarrow schema:\n `ArrowInvalid('Integer value 552 not in range: -128 to 127', 'Conversion failed for column X with type category')`
Are we doing something wrong to get the pyarrow schema? Is there a better way to do it?
We are using:
- pyarrow 10.0.1
- dask 2023.1.0
Thanks
Milton
Hi @miltava,
OK, clearly, I’ve never worked with pyarrow, so we might want some help from someone who knows better than me at one point.
In the meantime, it is a bit difficult to help without a minimal reproducer, do you think you could make one? At least, give a bit more lines of code, do yo get the error when processing a Dask Dataframe or before?
I’ve got to other questions though (that may be completely useless):
- Why are you using
make_meta
when constructing the pyarrow schema? Couldn’t you built it with the Pandas Dataframe only?
- Or is this because you’re working from a Dask Dataframe to build the schema (this is what you’re saying
)? In this case, it’s probably difficult to guess the correct column schema with only a few sample of the values.
- Isn’t it possible to modify the schema you get and replace the
dictionary<values=string, indices=int8, ordered=0>
with dictionary<values=string, indices=int16, ordered=0>
?
- Are you manually creating a pyarrow schema because the default behavior of
to_parquet
is failing to infer the correct schema?