Memory blow-up when dropping rows in xarray+dask using boolean masks

GaryFrewin · August 20, 2025, 6:46am

I’m working with a large xarray dataset backed by Dask (tens of millions of rows). I need to drop all positions where coverage is below a threshold across all samples. I want to check I’m not going mad

My first attempt was something like:

keep_pos = (ds[“some_data_var”] >= 1).all(dim=“sample_id”)
ds_filtered = ds.where(keep_pos, drop=True)

…but this causes a massive memory spike and sometimes crashes. From what I can tell, the boolean indexing + drop=True forces Dask to materialize a huge boolean mask and copy data for each variable.

I feel like I’ve tried umpteen different ways of doing this now… but every way I slice it, it blows up. I can of course, just mask with nans, so that chunk sizes stay the same, but this upsets downstream calculations that might not be ‘nan-safe’… at some point we want to drop the nan rows.

Is there a memory-safe way to drop rows (and coords) without materializing everything into memory at once? Ideally, I’d like to keep things lazy if possible, or is my only option to somehow stream this to disk in batches.

Thanks folks

guillaumeeb · August 22, 2025, 4:04pm

Hi @GaryFrewin, welcome to Dask discourse!

First, it would really help to have a complete minimal reproducer of this issue.

Second, it will also help if you could reproduce the behaviour with Dask Array only and not Xarray. If not possible, I would recommend you also put this topic on Pangeo Discourse.

A few questions or suggestions:

If you don’t use drop=True, the boolean indexing is working without memory spike?
If not, might it be the fact that your boolean index does not have the same dimensions as the initial Dataset?
Did you try to apply this filtering by chunk (with appropriate chunking) as this seems to be only depending on one dimension?

Topic		Replies	Views
Indexing a dask array with a boolean array Dask Array dask-array	2	1798	May 19, 2022
Slicing a dask array with a dask dataframe in one compute Dask Array dask-array , distributed	6	1609	January 14, 2022
Boolean array indexing rules do not follow those of numpy Dask Array dask-array , indexing	1	478	April 28, 2023
`dask.array.any` memory behavior Dask Array dask-array	5	106	November 20, 2024
Processing array subregions Dask Array dask-array	9	547	April 11, 2023

Memory blow-up when dropping rows in xarray+dask using boolean masks

Related topics