Still cannot get rid of string conversion for blob

For the bytes data (audio as bytes), while reading I get this error:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of large_string type

although I already have this defined:

from dask import config as dask_config
...
dask_config.set({"dataframe.convert-string": False})

I tried to define this globally, in class’ __init__() (but that class creates processes with futures), so also tried before each parquet read/write operation.

DBeaver with DuckDB can read it as blob… Also directly using pandas has no problem.

    ddf: DaskDataFrame = dd.read_parquet(  # type: ignore
        path=p,
        columns=["clip_id"],
        filters=[
            ("ds", "==", dsinfo.ds),
            ("lc", "==", dsinfo.lc),
        ],
        categories=["ds", "ls"],
        index=False,
        schema=schema,
        dtype_backend="pyarrow",
        metadata_task_size=0,
    )

PS: In the above code, p points to the root of the parquet /path/to/data/clips.parquet. I emphasize this because I saw SO posts which get individual .parquet files through glob to overcome this.

  1. Any pointers for the reason and solution?
  2. Why would dask not use the provided schema even it is lazy at that point (before compute)?

Hi @bozden,

Would you be able to provide a minimal example with some fake or small data?

What does the schema kwarg contains?

Maybe try to also disable query planning?

dask.config.set({'dataframe.query-planning': False})

ccing @martindurant here also.

Sorry, I can’t help with this one. At some point, dask[-expr] is assuming that anything that would previously be “object” is string.