Error when creating pyarrow schema from dask dataframe

miltava · January 27, 2023, 1:58pm

Hello everyone,

We are trying to build a pyarrow schema from a Dask Dataframe to write it to parquet but we’re getting an error.

We’re currently doing something like the code below to do this:

schema = pa.Schema.from_pandas(make_meta(df), preserve_index=False)

And the problem seems that we have a column that has many categories (500+) but its schema is being set to dictionary<values=string, indices=int8, ordered=0>.

So we get the error below:

ValueError: ('Long error message', "Failed to convert partition to expected pyarrow schema:\n    `ArrowInvalid('Integer value 552 not in range: -128 to 127', 'Conversion failed for column X with type category')`

Are we doing something wrong to get the pyarrow schema? Is there a better way to do it?

We are using:

pyarrow 10.0.1
dask 2023.1.0

Thanks

Milton

guillaumeeb · January 27, 2023, 5:19pm

Hi @miltava,

OK, clearly, I’ve never worked with pyarrow, so we might want some help from someone who knows better than me at one point.

In the meantime, it is a bit difficult to help without a minimal reproducer, do you think you could make one? At least, give a bit more lines of code, do yo get the error when processing a Dask Dataframe or before?

I’ve got to other questions though (that may be completely useless):

Why are you using make_meta when constructing the pyarrow schema? Couldn’t you built it with the Pandas Dataframe only?
Or is this because you’re working from a Dask Dataframe to build the schema (this is what you’re saying )? In this case, it’s probably difficult to guess the correct column schema with only a few sample of the values.
Isn’t it possible to modify the schema you get and replace the dictionary<values=string, indices=int8, ordered=0> with dictionary<values=string, indices=int16, ordered=0>?
Are you manually creating a pyarrow schema because the default behavior of to_parquet is failing to infer the correct schema?

jonnor · June 1, 2023, 11:21am

I had the same problem. In my case it happened when loading partitioned Parquet files, and I think that some partition had <255 unique values while others had more. This then failed inside to_parquet with

        dask.multiprocessing.ValueError: Failed to convert partition to expected pyarrow schema:
            `ArrowInvalid('Integer value 128 not in range: -128 to 127', 'Conversion failed for column device_id with type category')`
    
        Expected partition schema:
            device_id: dictionary<values=string, indices=int8, ordered=0>
            ...
        Received partition schema:
            device_id: dictionary<values=string, indices=int16, ordered=0>

I resolved this by manually specifying the schema parameter of to_parquet(), for the affected columns to use a larger size for the pyarrow dictionaries.

schema = {
      'device_id': pyarrow.dictionary(pyarrow.int32(), pyarrow.string()),
      'date': pyarrow.dictionary(pyarrow.int32(), pyarrow.string()),
  }
to_parquet(...., schema=schema)

Hope this helps someone, as I could not find much else on this particular error online

Topic		Replies	Views
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	806	August 29, 2023
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2049	January 25, 2024
Still cannot get rid of string conversion for blob Dask DataFrame	3	68	August 30, 2024
KeyError while using the read_parquet method Dask DataFrame	10	1060	August 21, 2023
Read_parquet caused "TypeError: '<' not supported between instances of 'NoneType' and 'str'" Dask DataFrame	4	1321	February 17, 2023

Error when creating pyarrow schema from dask dataframe

Related topics