Scheduling issues on SLURMCluster for transforming large arrays

d70-t · April 14, 2022, 10:26am

Hi,

I’m trying to rechunk a large-ish dataset (a bit less than 20TB) and would like to use a SLURMCluster for this task.
I tried both, using rechunker and xarray.open_zarr(..., chunks={...}).to_zarr(..., encoding={x: {"chunks": (...)}). In any case, I ended up with dask chunks which are 256 MB or less and I’ve reserved about 2 GB per dask worker, so I’d expect this to run fine. For the xarray case, I also ensured that each output chunk depends on a single input chunk only, so no complex inter-task dependencies.

When I run this operation on a LocalCluster, this works indeed smoothly (I mostly get about 300MB of memory usage per worker and never more than 1GB).

When I run exactly the same operation on a SLURMCluster, it crashes relatively quickly. What I’ve observed is, that in each case (xarray or rechunker), I mostly end up having two kinds of tasks: one for reading the data and another one for writing the data (roughly 100k of each). On the LocalCluster, the corresponding write-task is usually scheduled immediately after it’s read-task has finished, which keeps the memory usage low. On the SLURMCluster, many many read-tasks are scheduled upfront and it takes some time until the first write-tasks are scheduled. In many cases, the memory of a worker fills up completely before the first writes are scheduled and subsequently the worker gets killed or pauses indefinitely.
Write-tasks do get scheduled eventually, but I’ve got to increase my memory-reservation by about a 100-fold to have enough free space for the write-tasks to become scheduled. However sometimes these workers still crash.

I can imagine that this may because read-tasks are all ready while write-tasks are not ready and it takes a while until write tasks are sent over to the workers once they become ready. Is this (kind of) intended behaviour? Should it be different?

raybellwaves · April 20, 2022, 5:40pm

May also want to ask at https://discourse.pangeo.io/

bryanweber · April 22, 2022, 4:49pm

Hi! Thanks for posting this question. I’m not experienced at all with SLURMCluster, but I have a few guesses. Unfortunately, I don’t think there’s a ton that Dask itself can do here, but I’m happy to be proven wrong

Mainly, I wonder if there is some difference between LocalCluster and SLURMCluster. Just to clarify, are you running the LocalCluster on your local machine, or on a single node of the cluster? If you’re running LocalCluster on your local machine, I wonder if resource differences between the machines causes the scheduler to behave differently.

Do you have the same number of workers between the LocalCluster case and the SLURMCluster case? If there are a lot more workers on the SLURM side, that may cause the scheduler to become a bottleneck in terms of scheduling tasks.

Anyhow, I don’t know that any of that helps, but I would second the suggestion to ask in a domain-specific forum just in case someone there has experience with a similar dataset size. If you want to post here an MVCE to help us start to reproduce this, that’d be great. Obviously, sharing 20TB of data is tough, but maybe there’s a more minimal case that can reproduce the problem?

guillaumeeb · April 23, 2022, 12:03pm

As @raybellwaves mentions, there are some discussions on the Pangeo community about this:

And some issues in distributed, mainly:

github.com/dask/distributed

an example that shows the need for memory backpressure

opened 08:03PM - 18 Mar 19 UTC

closed 11:30PM - 30 Jun 21 UTC

rabernat

In my work with large climate datasets, I often concoct calculations that cause …my dask workers to run out of memory, start dumping to disk, and eventually grind my computation to a halt. There are many ways to mitigate this by e.g. using more workers, more memory, better disk-spilling settings, simpler jobs, etc. and these have all been tried over the years with some degree of success. But in this issue, I would like to address what I believe is the root of my problems within the dask scheduler algorithms. The core problem is that the tasks early in my graph generate data faster than it can be consumed downstream, causing data to pile up, eventually overwhelming my workers. Here is a self contained example: ```python import dask.array as dsa # create some random data # assume chunk structure is not under my control, because it originates # from the way the data is laid out in the underlying files shape = (500000, 100, 500) chunks = (100, 100, 500) data = dsa.random.random(shape, chunks=chunks) # now rechunk the data to permit me to do some computations along different axes # this aggregates chunks along axis 0 and dis-aggregates along axis 1 data_rc = data.rechunk((1000, 1, 500)) FACTOR = 15 def my_custom_function(f): # a pretend custom function that would do a bunch of stuff along # axis 0 and 2 and then reduce the data heavily return f.ravel()[::15][None, :] # apply that function to each chunk c1 = math.ceil(data_rc.ravel()[::FACTOR].size / c0) res = data_rc.map_blocks(my_custom_function, dtype=data.dtype, drop_axis=[1, 2], new_axis=[1], chunks=(1, c1)) res.compute() ``` (Perhaps this could be simplified further, but I have done my best to preserve the basic structure of my real problem.) When I watch this execute on my dashboard, I see the workers just keep generating data until they reach their memory thresholds, at which point they start writing data to disk, before `my_custom_function` ever gets called to relieve the memory buildup. Depending on the size of the problem and the speed of the disks where they are spilling, sometimes we can recover and manage to finish after a very long time. Usually the workers just stop working. This fail case is frustrating, because often I can achieve a reasonable result by just doing the naive thing: ``` for n in range(500): res[n].compute() ``` and evaluating my computation in serial. I wish the dask scheduler knew to stop generating new data before the downstream data could be consumed. I am not an expert, but I believe the term for this is [backpressure](https://www.quora.com/What-is-backpressure-in-the-context-of-data-streaming). I see this term has come up in https://github.com/dask/distributed/issues/641, and also in [this blog post](http://matthewrocklin.com/blog/work/2017/04/13/streaming) by @mrocklin regarding streaming data. I have a hunch that resolving this problem would resolve many of the pervasive but hard-to-diagnose problems we have in the xarray / pangeo sphere. But I also suspect it is not easy and requires major changes to core algorithms. Dask version 1.1.4

Your case seems really simple, so I second the idea to provide a MVCE. Dask should definitly be more aware of its workers memory and not read data if workers memory is full. This is a bit strange the LocalCluster achieves this, and not big distributed ones.

d70-t · April 27, 2022, 4:44pm

Thanks for all the replies and the pointers to github and pangeo, I’ll dig more through them.
I’ll also try to come up with an nice MVCE.

@bryanweber I did run the LocalCluster on the smae kind of machine as the workers within the SLURMCluster and I did configure each SLURM job the same way as the LocalCluster has been configured, thus having two slurm jobs would give twice the number of workers as compared to the LocalCluster. So this will create more load on the Scheduler, but I even had those problems with only a single single-threaded worker per job and only 3-4 jobs concurrently.

guillaumeeb · May 12, 2022, 1:56pm

A MVCE might be described here:

TomNicholas · October 19, 2022, 4:55pm

What I’ve observed is, that in each case (xarray or rechunker), I mostly end up having two kinds of tasks: one for reading the data and another one for writing the data

This sounds very much like the problem I was having. https://www.youtube.com/watch?v=ftlgOESINvo

See recent changes in distributed to address this Share your experiences with `worker-saturation` config to reduce memory usage · dask/distributed · Discussion #7128 · GitHub

Topic		Replies	Views
Memory allocation always <= 4GiB for distributed SLURMCluster workers Distributed dask-jobqueue , worker , distributed	8	733	July 12, 2022
Tasks slowing down significantly after 10-12 batches Distributed delayed , future , distributed	8	318	January 31, 2024
Placing limits on scheduler memory Distributed dask-jobqueue , distributed	8	555	February 28, 2023
My code works with LocalCluster but, not with SLURMCluster Distributed	4	456	September 17, 2022
Tasks forgotten waiting for new workers to be allocated Distributed dask-jobqueue , distributed	8	92	June 6, 2025

Scheduling issues on SLURMCluster for transforming large arrays

Related topics