Still cannot get rid of string conversion for blob

bozden · August 29, 2024, 4:09pm

For the bytes data (audio as bytes), while reading I get this error:

ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of large_string type

although I already have this defined:

from dask import config as dask_config
...
dask_config.set({"dataframe.convert-string": False})

I tried to define this globally, in class’ __init__() (but that class creates processes with futures), so also tried before each parquet read/write operation.

DBeaver with DuckDB can read it as blob… Also directly using pandas has no problem.

    ddf: DaskDataFrame = dd.read_parquet(  # type: ignore
        path=p,
        columns=["clip_id"],
        filters=[
            ("ds", "==", dsinfo.ds),
            ("lc", "==", dsinfo.lc),
        ],
        categories=["ds", "ls"],
        index=False,
        schema=schema,
        dtype_backend="pyarrow",
        metadata_task_size=0,
    )

PS: In the above code, p points to the root of the parquet /path/to/data/clips.parquet. I emphasize this because I saw SO posts which get individual .parquet files through glob to overcome this.

Any pointers for the reason and solution?
Why would dask not use the provided schema even it is lazy at that point (before compute)?

guillaumeeb · August 30, 2024, 1:47pm

Hi @bozden,

Would you be able to provide a minimal example with some fake or small data?

What does the schema kwarg contains?

Maybe try to also disable query planning?

dask.config.set({'dataframe.query-planning': False})

guillaumeeb · August 30, 2024, 3:51pm

ccing @martindurant here also.

martindurant · August 30, 2024, 4:57pm

Sorry, I can’t help with this one. At some point, dask[-expr] is assuming that anything that would previously be “object” is string.

Topic		Replies	Views
Loading Dataframe with string[pyarrow] into Dask Dask DataFrame	3	1052	June 10, 2022
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2077	January 25, 2024
Error when creating pyarrow schema from dask dataframe Dask DataFrame parquet , pyarrow	2	1749	June 1, 2023
Dask_cudf/dask read_parquet failed with NotImplementedError: large_string Dask DataFrame	1	385	April 4, 2023
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	820	August 29, 2023

Still cannot get rid of string conversion for blob

Related topics