Redundant scheduling of straggler tasks


Thanks for the great software. I’m currently using dask.distributed with dask.job-queue on a SLURM cluster to do some large-scale trivially parallel work. I’m experiencing an issue where, at any given time, a random subset of ≈1% of worker nodes are transiently slow / unresponsive. This results in ≈1% of tasks taking an inordinate amount of time and slowing the overall job loop. I’m wondering what you think is a good way for me to address this straggler problem?

An idea:

  • I register a scheduler plugin that double- or triple- schedules tasks once the ratio of executing/memory is < 5% or executing tasks < 10, and properly handles letting any of the redundant tasks “win the race”.

Am also open to other ideas you may have. thank you!

Hi @robert-verkuil,

Welcome to Dask Discourse.

I would first look at the reason it is so. Did you investigate to understand why 1% of worker nodes where slow? How many nodes do you use in your Dask Cluster? Could there be some hardware issues?

If you can’t fix that, then could you please give some code snippet of how you are using Dask to see if there might be an easier solution than a Scheduler plugin?

Apologies, I should have clarified, the random 1% slowness is due to the machines in the cluster, not dask. I imagine this is unavoidable, since it’s a large shared cluster and I’m requesting a large numbers of machines. I observe that due to other heavy workloads / random failures / etc there are a random, transient subset of machines that are slow at a given time (out of hundreds). The dask workers on those machines then take far longer to process tasks then normal.

And it seems like in this particular case, detecting the and killing slow workers is not a good solution for me.

Why is that? You might also be able to do this with some Worker plugin?

In any case, if you think you can implement the Scheduler plugin you talked about in your first post, then why not? Be careful of race conditions though.

Last, you could try to handle that on Client side, depending of your workflow.