Hi,
First of all thank you for dask. ![]()
I’m currently migrating a code base from dask 2023.5.x to 2024.7.x and I might have hit a bug with the optimizer. I’m not sure enough though so I didn’t feel confident opening a github issue.
I have a dask workflow which essentialy does this:
dask.dataframe.read_parquet(
...,
filters= [ .. some filters ...]
).map_partitions(
...
).to_parquet()
When running this operation with the optimizer, it first tries to repartition the read_parquet to a single partition which in itself could make sense.
However in the process of re-partitioning ,I suspect it doesn’t apply the filters argument to the read_parquet call, loading all the data (which is too big to fit in RAM) and thus crashes the workflow.
My current workaround, is to use the to_legacy_dataframe method this way:
dask.dataframe.read_parquet(
...,
filters= [ .. some filters ...]
).to_legacy_dataframe().map_partitions(...).to_parquet()
I haven’t been able to really identify the root cause though so I thought I would share it here.
Thank you for your help.