For the bytes data (audio as bytes), while reading I get this error:
ValueError: An error occurred while calling the read_parquet method registered to the pandas backend.
Original Message: ArrowStringArray requires a PyArrow (chunked) array of large_string type
although I already have this defined:
from dask import config as dask_config
...
dask_config.set({"dataframe.convert-string": False})
I tried to define this globally, in class’ __init__()
(but that class creates processes with futures), so also tried before each parquet read/write operation.
DBeaver with DuckDB can read it as blob… Also directly using pandas has no problem.
ddf: DaskDataFrame = dd.read_parquet( # type: ignore
path=p,
columns=["clip_id"],
filters=[
("ds", "==", dsinfo.ds),
("lc", "==", dsinfo.lc),
],
categories=["ds", "ls"],
index=False,
schema=schema,
dtype_backend="pyarrow",
metadata_task_size=0,
)
PS: In the above code, p points to the root of the parquet /path/to/data/clips.parquet
. I emphasize this because I saw SO posts which get individual .parquet files through glob to overcome this.
- Any pointers for the reason and solution?
- Why would dask not use the provided schema even it is lazy at that point (before compute)?