Hey there
I have one single function that I want to run on a dask ssh cluster with multiple distinct function parameters. When submitting 20 tasks to a cluster of 15 machines, two of them will get 12 tasks and all other machines remain idle. I hope you can tell me whether I use it the wrong way - this behaviour can’t be right and costs me so much time:
cluster = SSHCluster(
hosts = [ list of FQDNs ],
connect_options = {
"known_hosts": None,
"username": MYUSERNAME,
},
worker_options = {
"n_workers": 1},
# https://docs.dask.org/en/latest/_modules/distributed/deploy/ssh.html#SSHCluster
scheduler_options = {
"port": 37153,
"dashboard_address": ":8797"
}
)
client = Client(cluster)
futures = []
for i in range(20):
params = generate_params()
futures.append(client.submit(my_function, params1))
seq = as_completed(futures)
for f in seq:
res = f.result()
process_result(res)
f.cancel()
del f
if not stop_condition:
seq.add(client.submit(my_function, generate_params()))
Finishing one iteration of my_function can take quite some time - depending on the parameters. Once one task is done, a new one might even be assigned to an idling machine. But overall, it takes super long until the tasks are at least somewhat distributed across the entire cluster. (The opposite happens as well: once the tasks are evenly distributed, individual machines might return back to idling while some others get queues being multiples of the other machine ones.)
my_function involves using the GPU so increasing the process or thread count per machine won’t do any benefit.
dask and distributed version 2022.5.2