Say you wanted a distributed Dask cluster with a total of 100 CPU cores, 1TB memory, and 10 GPUs.
What’s the best way to set up the cluster worker nodes?
Should the workers nodes be uniform? i.e. same number of workers/thread, same CPU/Memory,GPU, etc. Or does it not matter?
Is it best to have each worker node running on a single thread (1 worker per node), or does it not matter? i.e. does running multiple threads on a worker node create overhead that can be avoided by simply using more worker nodes to provide the same amount of workers?
Because GPUs can’t be easily shared between workers, will the requirement of 10 GPUs limit me to having a max of 10 workers threads running on the cluster?
Ex:
- 10 workers nodes - Each node with a single worker running on a single thread, 10 CPU cores, 100GB memory, and 1 GPU (10 workers, single threaded, uniform)
- 5 worker nodes - Each node with 2 workers running on 2 threads, sharing 20 CPU cores, 200GB memory, and 2 GPUs (10 workers, multi threaded, uniform)
- 2 worker nodes - One worker node with 6 workers running on 6 threads, sharing 60 CPU cores, 600GB memory, and 6 GPUs. And the other worker node with 4 workers running 4 threads, sharing 40 CPU, 400GB memory, and 4 GPUs. (10 workers, multi threaded, non-uniform)
- 10 worker nodes - Each node with a 2 workers running on 2 threads, sharing 10 CPU cores, 100 GB memory, and 1 GPU (20 workers, multi threaded, uniform, but workers > GPUs so there are 10 workers that don’t have access to the single GPU on the node)
- etc.
In addition, does the Dask scheduler need to have GPUs?Is there a recommended amount of resources I should allocate to the scheduler with respect to the size of the cluster?