How to initialise a ssh cluster the right way and distribute tasks evenly

f-fritz · June 21, 2022, 4:32pm

Hey there

I have one single function that I want to run on a dask ssh cluster with multiple distinct function parameters. When submitting 20 tasks to a cluster of 15 machines, two of them will get 12 tasks and all other machines remain idle. I hope you can tell me whether I use it the wrong way - this behaviour can’t be right and costs me so much time:

cluster = SSHCluster(
	hosts = [ list of FQDNs ],
	connect_options = {
        "known_hosts": None,
        "username": MYUSERNAME,
    },
    worker_options = {
        "n_workers": 1},
    # https://docs.dask.org/en/latest/_modules/distributed/deploy/ssh.html#SSHCluster
    scheduler_options = {
        "port": 37153,
        "dashboard_address": ":8797"
    }
)

client = Client(cluster)

futures = []

for i in range(20):
    params = generate_params()
    futures.append(client.submit(my_function, params1))

seq = as_completed(futures)

for f in seq:
    res = f.result()
    process_result(res)
    f.cancel()
    del f
    if not stop_condition:
        seq.add(client.submit(my_function, generate_params()))

Finishing one iteration of my_function can take quite some time - depending on the parameters. Once one task is done, a new one might even be assigned to an idling machine. But overall, it takes super long until the tasks are at least somewhat distributed across the entire cluster. (The opposite happens as well: once the tasks are evenly distributed, individual machines might return back to idling while some others get queues being multiples of the other machine ones.)

my_function involves using the GPU so increasing the process or thread count per machine won’t do any benefit.

dask and distributed version 2022.5.2

ncclementi · June 23, 2022, 3:42pm

Hi @f-fritz, thank you for posting a question.

I’m not super familiarized with SSH cluster deployments however something I’m noticing is that you have n_workers=1 meaning that only one worker will be processing tasks? Maybe @jacobtomlinson can shine some light on the deployment aspect and how to set things to be coordinated with GPUs

gjoseph92 · June 23, 2022, 5:01pm

I’m guessing this is a similar problem to workload not balancing during scale up on dask-gateway · Issue #5599 · dask/distributed · GitHub. I doubt the fact that it’s SSH or GPUs are involved is relevant. Work balancing during scale-up is often poor (Use cases for work stealing · Issue #6600 · dask/distributed · GitHub).

f-fritz · June 26, 2022, 12:09pm

@ncclementi I am not quite sure, but I thought the n_workers parameter refers to the number of workers per machine and not per cluster. Each of my tasks utilises the entire GPU so starting another one on the same machine would fail.

f-fritz · June 26, 2022, 12:12pm

I see, thank you! I guess I looked for the wrong terms and read over the work-stealing entirely. I’ll check the documentation again

Topic		Replies	Views
Unbalanced utilisation of machines on SSHCluster (few with multiple tasks, few idling) Distributed	3	284	April 4, 2022
SSHCluster - Start a different number of workers on each host, programmatically Deploying Dask distributed	4	224	October 12, 2023
Dask SSH Cluster setup Distributed distributed	10	668	May 9, 2024
How to get each worker to process only one task when using SGECluster Distributed	11	671	September 27, 2023
One of the SSHCluster workers doesn't pick up tasks distributed	1	412	May 4, 2022

How to initialise a ssh cluster the right way and distribute tasks evenly

Related topics