Hi,
First of all thank you for dask.
I’m currently migrating a code base from dask 2023.5.x
to 2024.7.x
and I might have hit a bug with the optimizer. I’m not sure enough though so I didn’t feel confident opening a github issue.
I have a dask workflow which essentialy does this:
dask.dataframe.read_parquet(
...,
filters= [ .. some filters ...]
).map_partitions(
...
).to_parquet()
When running this operation with the optimizer, it first tries to repartition the read_parquet
to a single partition which in itself could make sense.
However in the process of re-partitioning ,I suspect it doesn’t apply the filters
argument to the read_parquet
call, loading all the data (which is too big to fit in RAM) and thus crashes the workflow.
My current workaround, is to use the to_legacy_dataframe
method this way:
dask.dataframe.read_parquet(
...,
filters= [ .. some filters ...]
).to_legacy_dataframe().map_partitions(...).to_parquet()
I haven’t been able to really identify the root cause though so I thought I would share it here.
Thank you for your help.