Scheduler keeps running when client disconnects

portellaa · August 4, 2022, 10:33am

Hello,

We have been running dask on our k8s cluster, using dask gateway to create schedulers.
Everything is working well, but we have a small issue with clients running forever even when they loose the connection with the client.

That’s not that straightforward, i will give two examples, a successful one and a failed one.
Oh all the schedulers are requested with the option shutdown_on_close=True

if dask_scheduler:
      cluster = self.gateway.connect(dask_scheduler, shutdown_on_close=True)
    else:
      cluster = self.gateway.new_cluster(cluster_options=options)

In both cases, Let’s imagine a JupyterLab, running in the same cluster, it connects to the gateway, requests a GatewayCluster and then creates a client to the scheduler provisioned by the gateway.

Successful one
I can scale workers, read data or whatever as long as the process ends, if i stop that JupyterLab, dies or whatever, the client dies and the scheduler is turned off.
This is the desired behaviour, there is no client attached to the scheduler, it fails the handler identity call so the scheduler shuts down.

Logs from the scheduler:

2022-08-04 10:25:18,040 - distributed.core - DEBUG - Message from 'tls://172.16.1.74:36674': {'op': 'heartbeat_worker', 'address': 'tls://172.16.1.74:33231', 'now': 1659608718.0356054, 'metrics': {'executing': 0, 'in_memory': 0, 'ready': 0, 'in_flight': 0, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'spilled_nbytes': {'memory': 0, 'disk': 0}, 'event_loop_interval': 0.019992456436157227, 'cpu': 2.0, 'memory': 149127168, 'time': 1659608718.034502, 'read_bytes': 329.7086762537488, 'write_bytes': 1136.995374475049, 'read_bytes_disk': 0.0, 'write_bytes_disk': 16500492.507137429, 'num_fds': 15}, 'executing': {}, 'extensions': {'shuffle': {}}, 'reply': True}
2022-08-04 10:25:18,040 - distributed.core - DEBUG - Calling into handler heartbeat_worker
2022-08-04 10:25:18,509 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:18,509 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:18,540 - distributed.core - DEBUG - Message from 'tls://172.16.1.74:36674': {'op': 'heartbeat_worker', 'address': 'tls://172.16.1.74:33231', 'now': 1659608718.5363405, 'metrics': {'executing': 0, 'in_memory': 0, 'ready': 0, 'in_flight': 0, 'bandwidth': {'total': 100000000, 'workers': {}, 'types': {}}, 'spilled_nbytes': {'memory': 0, 'disk': 0}, 'event_loop_interval': 0.019992456436157227, 'cpu': 4.0, 'memory': 149127168, 'time': 1659608718.5353162, 'read_bytes': 329.46350199731785, 'write_bytes': 1136.1498947665082, 'read_bytes_disk': 0.0, 'write_bytes_disk': 16488222.596538935, 'num_fds': 15}, 'executing': {}, 'extensions': {'shuffle': {}}, 'reply': True}
2022-08-04 10:25:18,541 - distributed.core - DEBUG - Calling into handler heartbeat_worker
2022-08-04 10:25:18,541 - distributed.scheduler - DEBUG - Suggest closing workers: ['tls://172.16.1.74:33231']
2022-08-04 10:25:18,541 - dask_gateway.scheduler_preload - INFO - Requesting scale to 0 workers from 1
2022-08-04 10:25:18,541 - distributed.scheduler - DEBUG - Suggest closing workers: ['dask-worker-a2ad6eb25acf4958b9cd105edd1afaf4-z5k65']
2022-08-04 10:25:18,577 - distributed.scheduler - INFO - Retire worker names ['dask-worker-a2ad6eb25acf4958b9cd105edd1afaf4-z5k65']
2022-08-04 10:25:18,578 - distributed.scheduler - INFO - Retiring worker tls://172.16.1.74:33231
2022-08-04 10:25:18,578 - distributed.active_memory_manager - DEBUG - Running policy: RetireWorker('tls://172.16.1.74:33231')
2022-08-04 10:25:18,578 - distributed.active_memory_manager - DEBUG - Retiring <WorkerState 'tls://172.16.1.74:33231', name: dask-worker-a2ad6eb25acf4958b9cd105edd1afaf4-z5k65, status: closing_gracefully, memory: 0, processing: 0>
2022-08-04 10:25:18,578 - distributed.active_memory_manager - INFO - Retiring worker tls://172.16.1.74:33231; no unique keys need to be moved away.
2022-08-04 10:25:18,578 - distributed.active_memory_manager - DEBUG - Active Memory Manager run in 0ms
2022-08-04 10:25:18,578 - distributed.scheduler - DEBUG - All unique keys on worker tls://172.16.1.74:33231 have been replicated elsewhere
2022-08-04 10:25:18,578 - distributed.scheduler - INFO - Remove worker <WorkerState 'tls://172.16.1.74:33231', name: dask-worker-a2ad6eb25acf4958b9cd105edd1afaf4-z5k65, status: closing_gracefully, memory: 0, processing: 0>
2022-08-04 10:25:18,578 - distributed.core - INFO - Removing comms to tls://172.16.1.74:33231
2022-08-04 10:25:18,578 - distributed.scheduler - INFO - Lost all workers
2022-08-04 10:25:18,579 - distributed.scheduler - DEBUG - Removed worker <WorkerState 'tls://172.16.1.74:33231', name: dask-worker-a2ad6eb25acf4958b9cd105edd1afaf4-z5k65, status: closed, memory: 0, processing: 0>
2022-08-04 10:25:18,579 - distributed.scheduler - INFO - Retired worker tls://172.16.1.74:33231
2022-08-04 10:25:18,587 - distributed.core - DEBUG - Lost connection to 'tls://172.16.1.74:36674' while reading message: in <TLS (closed)  local=tls://172.16.0.68:8786 remote=tls://172.16.1.74:36674>: Stream is closed. Last operation: heartbeat_worker
2022-08-04 10:25:20,510 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:20,510 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:22,510 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:22,511 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:24,510 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:24,510 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:26,510 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:26,511 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:28,511 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:28,511 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:30,511 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:41052': {'op': 'identity', 'reply': True}
2022-08-04 10:25:30,511 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 10:25:31,316 - distributed.scheduler - INFO - Remove client Client-1bde5a54-13df-11ed-8086-be04f20e3e24
2022-08-04 10:25:31,316 - distributed.scheduler - DEBUG - Client Client-1bde5a54-13df-11ed-8086-be04f20e3e24 releases keys: []
2022-08-04 10:25:31,316 - distributed.core - DEBUG - Lost connection to 'tls://172.16.0.27:41044' while reading message: in <TLS (closed)  local=tls://172.16.0.68:8786 remote=tls://172.16.0.27:41044>: Stream is closed. Last operation: subscribe_worker_status
2022-08-04 10:25:31,316 - distributed.scheduler - INFO - Remove client Client-1bde5a54-13df-11ed-8086-be04f20e3e24
2022-08-04 10:25:31,316 - distributed.scheduler - DEBUG - Finished handling client Client-1bde5a54-13df-11ed-8086-be04f20e3e24
2022-08-04 10:25:31,317 - distributed.scheduler - INFO - Close client connection: Client-1bde5a54-13df-11ed-8086-be04f20e3e24
2022-08-04 10:25:31,327 - distributed.core - DEBUG - Lost connection to 'tls://172.16.0.27:41052' while reading message: in <TLS (closed)  local=tls://172.16.0.68:8786 remote=tls://172.16.0.27:41052>: Stream is closed. Last operation: identity
2022-08-04 10:25:31,758 - distributed._signals - INFO - Received signal SIGTERM (15)
2022-08-04 10:25:31,758 - distributed.scheduler - INFO - Scheduler closing...
2022-08-04 10:25:31,759 - distributed.scheduler - INFO - Scheduler closing all comms
2022-08-04 10:25:31,759 - distributed.scheduler - INFO - Stopped scheduler at 'tls://172.16.0.68:8786'
2022-08-04 10:25:31,760 - distributed.scheduler - INFO - End scheduler

Failure one
If i read data from a local file or some remote location or start some process but it doesn’t finishes and the jupyterlab dies while that is running, or the process that started the read, the client dies but the scheduler keeps running forever, even if it’s not doing anything.
I have the message in the scheduler saying that the client is no longer connected, but that’s it.
I don’t have another client connected, at least that i know, but i don’t have logs saying that there is another client connected to that scheduler.

The logs for this case, i started the process of reading data but killed the JupyterLab right after it.

Logs from the scheduler

2022-08-04 09:16:31,075 - distributed.core - DEBUG - Calling into handler add_client
2022-08-04 09:16:31,075 - distributed.scheduler - INFO - Receive client connection: Client-20a6327c-13d6-11ed-805d-265a6f2be1a2
2022-08-04 09:16:31,077 - distributed.core - INFO - Starting established connection
2022-08-04 09:16:33,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:33,082 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:35,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:35,082 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:37,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:37,081 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:39,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:39,081 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:41,087 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:41,087 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:43,083 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:43,083 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:45,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:45,081 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:47,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:47,081 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:49,081 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:49,081 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:51,084 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:51,084 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:53,082 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:53,082 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:54,219 - distributed.scheduler - DEBUG - Client Client-20a6327c-13d6-11ed-805d-265a6f2be1a2 releases keys: ["('get-partition-0-read-csv-e7e3084d9f682d32b4f1a6fd5e542597', 0)"]
2022-08-04 09:16:55,088 - distributed.core - DEBUG - Message from 'tls://172.16.0.27:42524': {'op': 'identity', 'reply': True}
2022-08-04 09:16:55,088 - distributed.core - DEBUG - Calling into handler identity
2022-08-04 09:16:56,707 - distributed.core - DEBUG - Lost connection to 'tls://172.16.0.27:42510' while reading message: in <TLS (closed)  local=tls://172.16.0.45:8786 remote=tls://172.16.0.27:42510>: Stream is closed. Last operation: subscribe_worker_status
2022-08-04 09:16:56,707 - distributed.scheduler - INFO - Remove client Client-20a6327c-13d6-11ed-805d-265a6f2be1a2
2022-08-04 09:16:56,708 - distributed.scheduler - DEBUG - Client Client-20a6327c-13d6-11ed-805d-265a6f2be1a2 releases keys: []
2022-08-04 09:16:56,708 - distributed.scheduler - DEBUG - Finished handling client Client-20a6327c-13d6-11ed-805d-265a6f2be1a2
2022-08-04 09:16:56,708 - distributed.core - DEBUG - Lost connection to 'tls://172.16.0.27:42524' while reading message: in <TLS (closed)  local=tls://172.16.0.45:8786 remote=tls://172.16.0.27:42524>: Stream is closed. Last operation: identity
2022-08-04 09:16:56,708 - distributed.scheduler - INFO - Close client connection: Client-20a6327c-13d6-11ed-805d-265a6f2be1a2

Also an image showing the scheduler running:

Can someone help me with this, please?

Oh we tried the idle timeout, yes it kind of works, but sometimes we can have processes that run for let’s say one day but they don’t use dask or interact with the scheduler for some time (we don’t know how much, it depends on the data), so it’s a shot in the dark to play with that time.

Cheers

pavithraes · August 5, 2022, 7:54pm

@portellaa Welcome to Discourse! I’ll keep looking into this, I’d also encourage you to also share this on the dask-gateway issue tracker to reach the maintainers directly.

portellaa · August 8, 2022, 2:48pm

hello @pavithraes

thank you. i think i forgot to introduce myself i will do that in the meantime.

i wanted to understand first if it was a problem on my side, but since it works sometimes and others doesn’t, i’m not sure if it is

but thanks, i will post it on github then.

Thanks

Topic		Replies	Views
Client connection to scheduler fails after 45s Deploying Dask	10	253	January 19, 2023
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1235	November 3, 2023
Accessing dashboard of scheduler started programmatically Distributed	7	367	September 15, 2023
FutureCancelledError: scheduler-connection-lost due to high load? Distributed	8	434	December 19, 2024
Worker pods exist but client cannot connect to them or workers do not accept jobs Deploying Dask dask-gateway , kubernetes , distributed	7	76	June 27, 2024

Scheduler keeps running when client disconnects

Related topics