General cause/scenarios for `worker-handle-scheduler-connection-broken` error

Hey there! :wave:

I’ve recently been experiencing the “worker-handle-scheduler-connection-broken” error while using Dask.distributed, and I’m wondering what could be causing it. Can anyone help me identify scenarios that could lead to this error message? I want to understand what could be the root cause of it based on such scenarios.

Thanks in advance for your assistance! :pray:

Hi @viniciusdc, welcome to Dask community!

I have to admit I’m not sure of what error message you’re talking about. Could you please print the complete stack trace of the error? Also some details of your workflow or how the error happened may be useful.

Hey @guillaumeeb, thanks for the follow back! I haven’t had much time to work on this workflow yet and recheck the logs, but I will probably add those by this Friday. But I can give you some context of what it looks like.

  • We are running some simulations on dask, and one of the jobs relies on a third-party library that has some memory leaks that we can’t reduce, so that was affecting the workers’ execution to the point where the heartbeats become a problem (timing out), in an attempt to identify the cause,I’ve overwrite and disabled then (I found a config setting that does that somewhere).
  • It was working well so far, but during a long-running computation (~12h) we noticed a weird situation to occur:
    • the previous job that had ~40 workers had succeeded, and Gateway terminated the scheduler as usual, thus AWS started tearing down some of the nodes as they were not allocated to any pod anymore, all normal so far. During that process, though, another scheduler that was working (executing the long simulation) was terminated without a clear reason so far (no memory issues for the scheduler from k8s, at least no eviction warning). Then the workers started being killed by the Nany process with the error message above.

Based on the exception catch from the worker connection (in the source code), this seems to happen when the async connection object is terminated abruptly (my assumption), which does make sense as the scheduler was terminated (not sure why yet). Still, my original question was more to understand other possible scenarios I might have overlooked that could cause that error to appear. As I am uncertain if the message has any correlation to what could have happened with the scheduler or was just a collateral damage

It sounds normal that the Workers are being shutdown if they have no more Scheduler to talk to, there is a death-timeout config parameter for that. But anyway, workers without Scheduler would be useless. I can’t tell why the Scheduler has been terminated though.

I’m afraid I cannot answer this question. Maybe it would help if you would be able to give the complete stack trace?

Hi @guillaumeeb, sorry for the delay, lot of things going on, and I lost track of time. Regarding a stack trace, I am not sure if I can provide it as I am not encountering an error per sis, the execution of my task is suddenly stopped by the scheduler terminating (no OOM signal from the Kubernetes cluster).

This is the best I could get so far (a recent run that happened today):

... Some initial pre-computation goes in here...
2023-10-13 21:56:41,844 - distributed.core - INFO - Connection to tls://dask-e8b4d83f99974bde921ef1832be1dbb5.nebari:8786 has been closed.
INFO:distributed.core:Connection to tls://dask-e8b4d83f99974bde921ef1832be1dbb5.nebari:8786 has been closed.
2023-10-13 21:56:41,844 - distributed.worker - INFO - Stopping worker at tls://10.10.15.79:39489. Reason: worker-handle-scheduler-connection-broken
INFO:distributed.worker:Stopping worker at tls://10.10.15.79:39489. Reason: worker-handle-scheduler-connection-broken
2023-10-13 21:56:41,863 - distributed.nanny - INFO - Closing Nanny gracefully at 'tls://10.10.15.79:38669'. Reason: worker-handle-scheduler-connection-broken
Stream closed EOF for nebari/dask-worker-e8b4d83f99974bde921ef1832be1dbb5-z925q (dask-worker)

I couldn’t get the logs from the scheduler this time. was doing maintenance in our instance of Loki

No problem.

I’m afraid we cannot help much with that. We can only see that the Worker is terminating because it lost the Scheduler, which is normal.

Hi @guillaumeeb, I found the problem. Though to be fair, the root of the problem is still unknown to me, just as a context before I detail our problem:

  • Our workflow used to run inside a custom context manager written around the Gateway, which was responsible for creating/scaling/terminating the scheduler instances.
  • While inside the context, we called the client, submitted tasks, and gathered results

So, what was the issue?

  • One of the client.gather suddenly (this is the part I don’t know why it happens yet) sends a close connection to the scheduler (more specifically, it shows up in the scheduler, but there is no section in the code of this), and then due to the broken connection, tornado raised the exception error which triggered the __exit__ method of the context class, terminating the cluster – which then, naturally, release the associated workers.

Those tasks usually take hours to complete, so my current suspicion is that something needed to keep the connection with the client alive in the async routines is timing out. But that is where my knowledge goes so far…

  • It could be a heartbeat that does not return in time due to the Traefik layer (I don’t think it is, but it’s possible)
  • Because there is login involved to connect to the gateway pod, we are passing over the security proxy config. This could be timing out – but would that only affect the client? Because the cluster object is fine.
  • Or a built-in timeout for the gather method specifically

For the time being, I replaced all gather calls with as_completed as a way to I try to avoid more extended client connections as much as possible and try to rely more on the Futures API

Any ideas why a client might just hangout like that? Both the scheduler and workers are unaffected, only the client

I assume the issue for the client disconnection might be related to Traefik/Ingress routing, which could be causing a slight delay in the health checks (of the connection?) after some extended event tracking. In any case, I could work around it using an Event and saving the metadata (results of the tasks) into a pickle in the FS. (as these tasks are almost all independent, the results are stored in a DB, and only some status/success flags are returned). So all works now. Due to my issue being solved, we can close this topic.

Thanks a lot @viniciusdc for the update here. And sorry we weren’t able to help more. Network problems are always hard to identify and understand…