General cause/scenarios for `worker-handle-scheduler-connection-broken` error

Hey there! :wave:

I’ve recently been experiencing the “worker-handle-scheduler-connection-broken” error while using Dask.distributed, and I’m wondering what could be causing it. Can anyone help me identify scenarios that could lead to this error message? I want to understand what could be the root cause of it based on such scenarios.

Thanks in advance for your assistance! :pray:

Hi @viniciusdc, welcome to Dask community!

I have to admit I’m not sure of what error message you’re talking about. Could you please print the complete stack trace of the error? Also some details of your workflow or how the error happened may be useful.

Hey @guillaumeeb, thanks for the follow back! I haven’t had much time to work on this workflow yet and recheck the logs, but I will probably add those by this Friday. But I can give you some context of what it looks like.

  • We are running some simulations on dask, and one of the jobs relies on a third-party library that has some memory leaks that we can’t reduce, so that was affecting the workers’ execution to the point where the heartbeats become a problem (timing out), in an attempt to identify the cause,I’ve overwrite and disabled then (I found a config setting that does that somewhere).
  • It was working well so far, but during a long-running computation (~12h) we noticed a weird situation to occur:
    • the previous job that had ~40 workers had succeeded, and Gateway terminated the scheduler as usual, thus AWS started tearing down some of the nodes as they were not allocated to any pod anymore, all normal so far. During that process, though, another scheduler that was working (executing the long simulation) was terminated without a clear reason so far (no memory issues for the scheduler from k8s, at least no eviction warning). Then the workers started being killed by the Nany process with the error message above.

Based on the exception catch from the worker connection (in the source code), this seems to happen when the async connection object is terminated abruptly (my assumption), which does make sense as the scheduler was terminated (not sure why yet). Still, my original question was more to understand other possible scenarios I might have overlooked that could cause that error to appear. As I am uncertain if the message has any correlation to what could have happened with the scheduler or was just a collateral damage

It sounds normal that the Workers are being shutdown if they have no more Scheduler to talk to, there is a death-timeout config parameter for that. But anyway, workers without Scheduler would be useless. I can’t tell why the Scheduler has been terminated though.

I’m afraid I cannot answer this question. Maybe it would help if you would be able to give the complete stack trace?