How to initialise a ssh cluster the right way and distribute tasks evenly

Hey there

I have one single function that I want to run on a dask ssh cluster with multiple distinct function parameters. When submitting 20 tasks to a cluster of 15 machines, two of them will get 12 tasks and all other machines remain idle. I hope you can tell me whether I use it the wrong way - this behaviour can’t be right and costs me so much time:

cluster = SSHCluster(
	hosts = [ list of FQDNs ],
	connect_options = {
        "known_hosts": None,
        "username": MYUSERNAME,
    worker_options = {
        "n_workers": 1},
    scheduler_options = {
        "port": 37153,
        "dashboard_address": ":8797"

client = Client(cluster)

futures = []

for i in range(20):
    params = generate_params()
    futures.append(client.submit(my_function, params1))

seq = as_completed(futures)

for f in seq:
    res = f.result()
    del f
    if not stop_condition:
        seq.add(client.submit(my_function, generate_params()))

Finishing one iteration of my_function can take quite some time - depending on the parameters. Once one task is done, a new one might even be assigned to an idling machine. But overall, it takes super long until the tasks are at least somewhat distributed across the entire cluster. (The opposite happens as well: once the tasks are evenly distributed, individual machines might return back to idling while some others get queues being multiples of the other machine ones.)

my_function involves using the GPU so increasing the process or thread count per machine won’t do any benefit.

dask and distributed version 2022.5.2

Hi @f-fritz, thank you for posting a question.

I’m not super familiarized with SSH cluster deployments however something I’m noticing is that you have n_workers=1 meaning that only one worker will be processing tasks? Maybe @jacobtomlinson can shine some light on the deployment aspect and how to set things to be coordinated with GPUs

1 Like

I’m guessing this is a similar problem to workload not balancing during scale up on dask-gateway · Issue #5599 · dask/distributed · GitHub. I doubt the fact that it’s SSH or GPUs are involved is relevant. Work balancing during scale-up is often poor (Use cases for work stealing · Issue #6600 · dask/distributed · GitHub).

1 Like

@ncclementi I am not quite sure, but I thought the n_workers parameter refers to the number of workers per machine and not per cluster. Each of my tasks utilises the entire GPU so starting another one on the same machine would fail.

I see, thank you! I guess I looked for the wrong terms and read over the work-stealing entirely. I’ll check the documentation again :+1: