Tuning Distributed Dask Clusters with GPUs

Say you wanted a distributed Dask cluster with a total of 100 CPU cores, 1TB memory, and 10 GPUs.
What’s the best way to set up the cluster worker nodes?

Should the workers nodes be uniform? i.e. same number of workers/thread, same CPU/Memory,GPU, etc. Or does it not matter?

Is it best to have each worker node running on a single thread (1 worker per node), or does it not matter? i.e. does running multiple threads on a worker node create overhead that can be avoided by simply using more worker nodes to provide the same amount of workers?

Because GPUs can’t be easily shared between workers, will the requirement of 10 GPUs limit me to having a max of 10 workers threads running on the cluster?

Ex:

  • 10 workers nodes - Each node with a single worker running on a single thread, 10 CPU cores, 100GB memory, and 1 GPU (10 workers, single threaded, uniform)
  • 5 worker nodes - Each node with 2 workers running on 2 threads, sharing 20 CPU cores, 200GB memory, and 2 GPUs (10 workers, multi threaded, uniform)
  • 2 worker nodes - One worker node with 6 workers running on 6 threads, sharing 60 CPU cores, 600GB memory, and 6 GPUs. And the other worker node with 4 workers running 4 threads, sharing 40 CPU, 400GB memory, and 4 GPUs. (10 workers, multi threaded, non-uniform)
  • 10 worker nodes - Each node with a 2 workers running on 2 threads, sharing 10 CPU cores, 100 GB memory, and 1 GPU (20 workers, multi threaded, uniform, but workers > GPUs so there are 10 workers that don’t have access to the single GPU on the node)
  • etc.

In addition, does the Dask scheduler need to have GPUs?Is there a recommended amount of resources I should allocate to the scheduler with respect to the size of the cluster?

Hi @txu0393 thanks for posting this question, maybe @jacobtomlinson and/or @quasiben can chip in here and provide some insight.

1 Like

Hello there @txu0393. Appreciate getting to chat with you and your team earlier today.

IIUC correctly, in addition to best practices around GPU worker configurations/resources, you are also interested in knowing how to configure GPU clusters using your existing dask-gateway installation with Kubernetes. You also mentioned that you are using a GPU operator (e.g., something like NVIDIA/gpu-operator on GitHub?).

The most relevant information I could find related to using GPU workers in a Dask cluster in Kubernetes (e.g., with dask-gateway) includes:

which should help you in terms of configuring the Kubernetes resources, Docker images, and additional worker options that are needed to get GPU workers. Other folks closer to dask-gateway might be able to provide more specific steps or additional documentation/resources.

1 Like

Apologies for the delay in responding here. Let me answer some of your questions directly.

Should the workers nodes be uniform?

Worker hardware can be different, but we generally recommend the software environment (most likely conda environment) is uniform. However you need to consider that any task can be run on any worker by default, so your least powerful worker should be the minimum bar for running each task.

You can intentionally restrict tasks to run on specific workers using resource annotations. For instance if you have tasks that need a lot of memory or a GPU then annotations can be great here.

Is it best to have each worker node running on a single thread (1 worker per node), or does it not matter?

Workers can have 1-n threads and 1-n processes. It is best to ensure the number of threads x processes is equal to the CPU cores in the node. If you run dask-worker --nprocs=auto it will autodetect this and configure things with sensible defaults.

Because GPUs can’t be easily shared between workers, will the requirement of 10 GPUs limit me to having a max of 10 workers threads running on the cluster?

You can have a mixture of GPU and non-GPU workers.

In addition, does the Dask scheduler need to have GPUs?

In theory no, but we generally recommend that it has one. This is because there may be edge case bugs where some accidental GPU deserialization happens on the scheduler and would cause the scheduler to die if it doesn’t have a GPU.

Is there a recommended amount of resources I should allocate to the scheduler with respect to the size of the cluster?

This really varies depending on the graphs you will be submitting to the scheduler. The more tasks the more work needs to be done. I don’t have any concrete advice here, other than to say I usually make the scheduler equal to a worker in terms of hardware and have successfully scaled to over 1000 workers.

1 Like