Memory blow-up when dropping rows in xarray+dask using boolean masks

I’m working with a large xarray dataset backed by Dask (tens of millions of rows). I need to drop all positions where coverage is below a threshold across all samples. I want to check I’m not going mad

My first attempt was something like:

keep_pos = (ds[“some_data_var”] >= 1).all(dim=“sample_id”)
ds_filtered = ds.where(keep_pos, drop=True)

…but this causes a massive memory spike and sometimes crashes. From what I can tell, the boolean indexing + drop=True forces Dask to materialize a huge boolean mask and copy data for each variable.

I feel like I’ve tried umpteen different ways of doing this now… but every way I slice it, it blows up. I can of course, just mask with nans, so that chunk sizes stay the same, but this upsets downstream calculations that might not be ‘nan-safe’… at some point we want to drop the nan rows.

Is there a memory-safe way to drop rows (and coords) without materializing everything into memory at once? Ideally, I’d like to keep things lazy if possible, or is my only option to somehow stream this to disk in batches.

Thanks folks

Hi @GaryFrewin, welcome to Dask discourse!

First, it would really help to have a complete minimal reproducer of this issue.

Second, it will also help if you could reproduce the behaviour with Dask Array only and not Xarray. If not possible, I would recommend you also put this topic on Pangeo Discourse.

A few questions or suggestions:

  • If you don’t use drop=True, the boolean indexing is working without memory spike?
  • If not, might it be the fact that your boolean index does not have the same dimensions as the initial Dataset?
  • Did you try to apply this filtering by chunk (with appropriate chunking) as this seems to be only depending on one dimension?