Hello everyone
I am using the SLURMCluster
from dask jobqueue to manage scheduling on a HPC cluster.
Right now I try to evaluate how well my code scales with increasing compute power provided by the infrastructure. I can do this by measuring runtimes for different amounts of workers and sitching between the amounts with the scale
method of the cluster object.
However, since the underlying SLURM workload manger can take some time for starting a job, my runtime measurements might start on fewer workers initially than intended. So I would like to have a scale_and_wait
function, which waits for the workers to start up (and shut down), so that I can provide a guarantee that I run my computation on a certain amount of workers.
Is there any functionality built into the dask framwork which allows my to wait for such worker startups (and shutdowns)?
I would suppose that I could run a busy loop and check for the amount of registered workers to the scheduler at every iteration, but obviously that would be a very dirty solution.