Workers constantly dying

Hello,

I have been using dask futures to run simulations across a cluster through subprocesses for a while now but have been running into issues recently. On certain nodes (newly added local windows machines), dask seems to be killing workers consistently and I can’t figure out why. Each of the subprocesses is an EXE simulation that can take anywhere from 2 minutes to 2 hours. Some machines have zero issues where as the new machines all cause issues. In the past, when I did not have enough RAM on the machine, the subprocesses would fail and take the workers with them it seemed. Adding RAM helped, but changing the page file size seemed to make some difference as well (which I dont understand). Additionally, I have made sure that all workers, schedulers, and clients are using the same versions of python/libraries/environment.

I have been wrapping the workers in a powershell loop to force the workers to restart

while ($true) {dask worker scheduler:8786 --nworkers 50 --nthreads 1 --memory-limit '4GB' --name blade2 --death-timeout 120}

On this specific node, I have 512GB of ram so I do not expect to run into ram issues.

I have also started monitoring the workers with grafana/promethesus – you can see the workers dying and restarting

Looking at the worker logs, it seems like these are the reasons for shutting down workers I’ve come across

‘nanny-close’,
‘nanny-close-gracefully’,
‘scheduler-remove-worker’,
‘worker-close’,
‘worker-handle-scheduler-connection-broken’

I found How to keep GIL-holding tasks from killing your workers? which could possibly be the issue but I can’t tell.

Any ideas/help would be appreciated! :slight_smile:

Hi @nickvazz, welcome to Dask discourse forum!

So based on your post, it seems that the problem comes only from the newly added Windows machines?

Are you able to execute a really simple task on them, like submitting a really simple computation through the Future API?

Does these machines manage to finish any of your tasks at all?

Are the other machines configured with the same amount of RAM per process?

And finally, do you have other information on the Workers or Scheduler logs? The reasons you gave are probably just result of on other problem: the worker process is stopping because something is wrong before.

I think I figured it out to be the incoming port rules of the new worker nodes. Thank you for your response though! I was about to update the thread. The answer to all your questions was yes, I could submit other job types, some would finish and they all had the same ram for each process. I wonder if there is a set up debugging steps that can be gone through to verify you have port rules situated correctly in the documentation somewhere.

Glad you fixed your problem!

I don’t think so, could you be a bit more precise on what you would expect?

I guess ideally there would be some “common deployment issues” stress test you could run and if it leads to these types of logs appearing and this behavior, then check your port rules? I’m not sure as I barely figured out the port rules thing myself :joy: