When should you use Dask-MPI vs dask-jobqueue?

I am trying to set up dask with our servers. I noticed that there seems to be two options to scale to multiple nodes, the mpi or the jobqueue. Is there any reason to use one or the other? My server currently has MPI, but we may be able to set up the jobqueue as well.

I just wanted to clarify whether people just use whichever version is supported on their system, or if there was reason to have both simultaneously for different use cases.

Thanks for the help!

Hi @zechiel and welcome! Dask-jobqueue is currently the simpler and recommended way to run Dask on an HPC cluster, and can also adapt the cluster size dynamically based on current load. It is slightly more oriented towards interactive use, however, so dask-mpi may be better suited for some batch production workloads (see here for more details).

Hope that helps!


Thanks for the reply!

What advantage does dask-mpi have for batch jobs? Does it run faster?

That’s a good question, my impression is it’s less about one running faster than another and more about which works better for your workflow-- @guillaumeeb I wonder if you might have some insight on this topic?

1 Like

Hi, thanks for the ping @scharlottej13.

With Dask-jobqueue, every worker (or small group of workers depending on your setup) will run in individual HPC jobs. This means you cannot guarantee that these jobs and workers will run at the same time. They might be just delayed a bit, but they might also be scheduled at larger intervals. This is good if you don’t care to have all your workers, if you only want a few computing resources, or if you want adaptive scaling.

With Dask-mpi, you are guaranteed to have all the computing resources at the same time, during the walltime you asked for. This is because all of your workers are running inside the same job, so the same allocation of resources. So it is easier to size your job appropriately for batch usage. But you probably have to wait much longer to have all the resources you asked for available at the same time. This can be a problem for interactive usage, not for batch processing that can run during the night.


Perfect, thanks for the reply. This was the exact answer I was looking for.

1 Like