Loading Dataframe with string[pyarrow] into Dask

I began converting a few columns to string[pyarrow] in Pandas, and then load the Dataframe into Dask Dataframe as usual. However, I began seeing this error

2022-06-05 17:36:57,107 - distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/core.py", line 159, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/core.py", line 139, in _decode_default
    return merge_and_deserialize(
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 487, in merge_and_deserialize
    return deserialize(header, merged_frames, deserializers=deserializers)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 416, in deserialize
    return loads(header, frames)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/serialize.py", line 95, in pickle_loads
    return pickle.loads(x, buffers=buffers)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 64, in loads
    return pickle.loads(x, buffers=buffers)
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 122, in rebuild_arrowstringarray
    [pyarrow_stringarray_from_parts(*parts) for parts in chunk_parts],
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 122, in <listcomp>
    [pyarrow_stringarray_from_parts(*parts) for parts in chunk_parts],
  File "/opt/homebrew/Caskroom/miniforge/base/envs/singularity/lib/python3.8/site-packages/dask/dataframe/_pyarrow_compat.py", line 116, in pyarrow_stringarray_from_parts
    return pa.StringArray.from_buffers(nitems, data_offsets, data, mask, offset=offset)
  File "pyarrow/array.pxi", line 2092, in pyarrow.lib.StringArray.from_buffers
  File "pyarrow/array.pxi", line 981, in pyarrow.lib.Array.from_buffers
  File "pyarrow/array.pxi", line 1318, in pyarrow.lib.Array.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Buffer #0 too small in array of type string and length 31383: expected at least 3924 byte(s), got 3923

This error happens when I call compute or persist. I am not sure what is causing this. I am using the latest version of yarrow and Dask.

Hi @nlhkha can you tell us more about how to reproduce this error, could you provide a minimal reproducible example? Craft Minimal Bug Reports

Also, can you provide the version of dask, distributed and pandas, that you are using?

Thank you

1 Like

Hi @ncclementi,
Yes, gladly.

import pandas as pd
from dask import Dataframe as ddf
from distributed import Client, LocalCluster

cluster = LocalCluster(threads_per_worker=1)
client = Client(cluster)

data = pd.util.testing.makeCustomDataframe(nrows=1000, ncols=10).reset_index()
data[["R0", "C_l0_g0", "C_l0_g1", "C_l0_g2", "C_l0_g3", "C_l0_g4", "C_l0_g5" ,"C_l0_g6", "C_l0_g7", "C_l0_g8", "C_l0_g9"]] = data[["R0", "C_l0_g0", "C_l0_g1", "C_l0_g2", "C_l0_g3", "C_l0_g4", "C_l0_g5" ,"C_l0_g6", "C_l0_g7", "C_l0_g8", "C_l0_g9"]].astype("string[pyarrow]")
data = ddf.from_pandas(data, npartitions=128)
data = data.persist()
data.groupby(["R0", "C_l0_g0", "C_l0_g1", "C_l0_g2", "C_l0_g3", "C_l0_g4", "C_l0_g5" ,"C_l0_g6"]).apply(lambda d: pd.DataFrame([sum(d["C_l0_g9"].str.len())], columns=["length"]), meta={"length": "int32"}).compute()

What I expect is that the computation completed successfully. Instead, I got this error

pyarrow.lib.ArrowInvalid: Buffer #0 too small in array of type string and length 3: expected at least 2 byte(s), got 1

You can make this computation successful by doing one of these:

  1. Reduce the number of rows, to about 500
  2. NOT create a distributed cluster
  3. Do not convert columns to string[pyarrow].

I am testing with:

  • Dask/Distributed: 2022.1.1, 2022.5.1, 2022.5.2
  • Pandas: 1.3.5

I am running on MacOS with Apple M1.

Here is my ticket Pyarrow string bug · Issue #9163 · dask/dask · GitHub.

1 Like

Thank you for opening an issue, it has now been fixed: Pyarrow string bug · Issue #9163 · dask/dask · GitHub :tada: