Hello,
I have been using dask futures to run simulations across a cluster through subprocesses for a while now but have been running into issues recently. On certain nodes (newly added local windows machines), dask seems to be killing workers consistently and I can’t figure out why. Each of the subprocesses is an EXE simulation that can take anywhere from 2 minutes to 2 hours. Some machines have zero issues where as the new machines all cause issues. In the past, when I did not have enough RAM on the machine, the subprocesses would fail and take the workers with them it seemed. Adding RAM helped, but changing the page file size seemed to make some difference as well (which I dont understand). Additionally, I have made sure that all workers, schedulers, and clients are using the same versions of python/libraries/environment.
I have been wrapping the workers in a powershell loop to force the workers to restart
while ($true) {dask worker scheduler:8786 --nworkers 50 --nthreads 1 --memory-limit '4GB' --name blade2 --death-timeout 120}
On this specific node, I have 512GB of ram so I do not expect to run into ram issues.
I have also started monitoring the workers with grafana/promethesus – you can see the workers dying and restarting
Looking at the worker logs, it seems like these are the reasons for shutting down workers I’ve come across
‘nanny-close’,
‘nanny-close-gracefully’,
‘scheduler-remove-worker’,
‘worker-close’,
‘worker-handle-scheduler-connection-broken’
I found How to keep GIL-holding tasks from killing your workers? which could possibly be the issue but I can’t tell.
Any ideas/help would be appreciated!