I am using an SSHCluster with 60 workers distributed across 8 nodes. The cluster is started and managed programmatically. This is a long running simulation and it is expected to run over approx 48 hours. The work is split over the available nodes, and assigned to the remote workers via client.submit calls. The worker is specified explicitly for each submit call.
The following is a simplified example. param1, param2, param3 represent the data structures that I am passing. They will be different for each submit call; but how they are being generated is out of scope, hence why I simplified them into param1-3.
for worker_index in range(dask_numtasks):
params = (param1, param2, param3, ...)
worker_url = worker_urls[worker_index]
future = client.submit(worker_method, params, workers=worker_url)
futures.append(future)
for future in as_completed(futures):
result = future.result()
process_result(result)
The problem is that sometimes I am getting a situation where 1 worker is being dropped. And the simulation gets stuck. I have seen this happening in real time a couple of times. I would be looking at the list of workers from the Dashboard, and one of the workers will be highlighted with a light red background, with the “last seen” value increasing until it times out. When this happens and I try to access the worker logs, the page gets stuck trying to load and the logs are never displayed. The scheduler logs also don’t show any particular errors or potentially related logs. I have also checked the faulty node for any potential issues. All the workers would still be running and visible in System Monitor, and there will be no error logs whatsoever.
I have had cases where there were errors in the logic, and these are all handled correctly and propagated back to the client to ensure proper logging. This seems to be a network (or Dask) issue or something similar. The fact that it doesn’t always happen at the same stage of the simulation, in my opinion, substantiates my suspicions. Could this be an overload of the network switch? Is there any other way to understand (or hopefully know for sure) what might be happening?
I think the simulation gets stuck because the as_completed function keeps waiting for a future, which never returns.
I have looked at the resilience link in the Dask Distributed documentation, but it does not go into detail with regards to specific contexts (e.g. when evaluating the results via the as_completed method). The as_completed method is particularly useful for my workflow, as it ensures that results are processed as soon as they are ready and returned back to the client. How would one handle resilience in my case (i.e. continuing gracefully with a missing worker)?