Workers constantly dying

nickvazz · July 18, 2023, 7:22pm

Hello,

I have been using dask futures to run simulations across a cluster through subprocesses for a while now but have been running into issues recently. On certain nodes (newly added local windows machines), dask seems to be killing workers consistently and I can’t figure out why. Each of the subprocesses is an EXE simulation that can take anywhere from 2 minutes to 2 hours. Some machines have zero issues where as the new machines all cause issues. In the past, when I did not have enough RAM on the machine, the subprocesses would fail and take the workers with them it seemed. Adding RAM helped, but changing the page file size seemed to make some difference as well (which I dont understand). Additionally, I have made sure that all workers, schedulers, and clients are using the same versions of python/libraries/environment.

I have been wrapping the workers in a powershell loop to force the workers to restart

while ($true) {dask worker scheduler:8786 --nworkers 50 --nthreads 1 --memory-limit '4GB' --name blade2 --death-timeout 120}

On this specific node, I have 512GB of ram so I do not expect to run into ram issues.

I have also started monitoring the workers with grafana/promethesus – you can see the workers dying and restarting

Looking at the worker logs, it seems like these are the reasons for shutting down workers I’ve come across

‘nanny-close’,
‘nanny-close-gracefully’,
‘scheduler-remove-worker’,
‘worker-close’,
‘worker-handle-scheduler-connection-broken’

I found How to keep GIL-holding tasks from killing your workers? which could possibly be the issue but I can’t tell.

Any ideas/help would be appreciated!

guillaumeeb · July 20, 2023, 11:24am

Hi @nickvazz, welcome to Dask discourse forum!

So based on your post, it seems that the problem comes only from the newly added Windows machines?

Are you able to execute a really simple task on them, like submitting a really simple computation through the Future API?

Does these machines manage to finish any of your tasks at all?

Are the other machines configured with the same amount of RAM per process?

And finally, do you have other information on the Workers or Scheduler logs? The reasons you gave are probably just result of on other problem: the worker process is stopping because something is wrong before.

nickvazz · July 20, 2023, 1:47pm

I think I figured it out to be the incoming port rules of the new worker nodes. Thank you for your response though! I was about to update the thread. The answer to all your questions was yes, I could submit other job types, some would finish and they all had the same ram for each process. I wonder if there is a set up debugging steps that can be gone through to verify you have port rules situated correctly in the documentation somewhere.

guillaumeeb · July 20, 2023, 2:12pm

Glad you fixed your problem!

I don’t think so, could you be a bit more precise on what you would expect?

nickvazz · July 20, 2023, 2:28pm

I guess ideally there would be some “common deployment issues” stress test you could run and if it leads to these types of logs appearing and this behavior, then check your port rules? I’m not sure as I barely figured out the port rules thing myself

Topic		Replies	Views
Terminate Workers when scheduler dies Distributed	3	183	December 1, 2023
Dask workers killed because of heartbeat fail Distributed worker , distributed	3	4099	August 1, 2022
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1452	March 13, 2024
Worker blocking on memory limit, despite the streaming-friendly pipeline process Distributed	3	217	March 28, 2023
Scheduler not saturating workers Distributed future , distributed	9	295	August 9, 2023

Workers constantly dying

Related topics