I began converting a few columns to string[pyarrow]
in Pandas, and then load the Dataframe into Dask Dataframe as usual. However, I began seeing this error
2022-06-05 17:36:57,107 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/core.py", line 159, in loads
return msgpack.loads(
File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/core.py", line 139, in _decode_default
return merge_and_deserialize(
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 487, in merge_and_deserialize
return deserialize(header, merged_frames, deserializers=deserializers)
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 416, in deserialize
return loads(header, frames)
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 95, in pickle_loads
return pickle.loads(x, buffers=buffers)
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 64, in loads
return pickle.loads(x, buffers=buffers)
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 122, in rebuild_arrowstringarray
[pyarrow_stringarray_from_parts(*parts) for parts in chunk_parts],
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 122, in <listcomp>
[pyarrow_stringarray_from_parts(*parts) for parts in chunk_parts],
File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 116, in pyarrow_stringarray_from_parts
return pa.StringArray.from_buffers(nitems, data_offsets, data, mask, offset=offset)
File "pyarrow/array.pxi", line 2092, in pyarrow.lib.StringArray.from_buffers
File "pyarrow/array.pxi", line 981, in pyarrow.lib.Array.from_buffers
File "pyarrow/array.pxi", line 1318, in pyarrow.lib.Array.validate
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Buffer #0 too small in array of type string and length 31383: expected at least 3924 byte(s), got 3923
This error happens when I call compute
or persist
. I am not sure what is causing this. I am using the latest version of yarrow and Dask.