I’m working with a large xarray dataset backed by Dask (tens of millions of rows). I need to drop all positions where coverage is below a threshold across all samples. I want to check I’m not going mad
My first attempt was something like:
keep_pos = (ds[“some_data_var”] >= 1).all(dim=“sample_id”)
ds_filtered = ds.where(keep_pos, drop=True)
…but this causes a massive memory spike and sometimes crashes. From what I can tell, the boolean indexing + drop=True forces Dask to materialize a huge boolean mask and copy data for each variable.
I feel like I’ve tried umpteen different ways of doing this now… but every way I slice it, it blows up. I can of course, just mask with nans, so that chunk sizes stay the same, but this upsets downstream calculations that might not be ‘nan-safe’… at some point we want to drop the nan rows.
Is there a memory-safe way to drop rows (and coords) without materializing everything into memory at once? Ideally, I’d like to keep things lazy if possible, or is my only option to somehow stream this to disk in batches.
Thanks folks