Gather via persistent worker connection from scheduler?

RaiinmakerWes · June 28, 2024, 7:42pm

This is related to Workers on private network, scheduler on a different network - how to make “gather” step work?, but I’d like to explore a potential solution rather than necro-bumping an old thread.

I have a similar setup with a centralized scheduler running in a cloud provider, and a user-distributed script to run workers at will on arbitrary private computers.I’ve solved for the cloud ingress/egress, so workers are able to successfully connect to the scheduler to self identify and we are able to see the heartbeat calls inside of the scheduler.
When a client connects and submits a job the workers are able to receive and perform the tasks successfully and I see the debug logs in the scheduler indicating the task completion. The problem is again at gather()… The scheduler seems to always attempt to initiate a new TCP connect to the worker address, which gets blocked by the private workers’ various and untouchable-for-our-purposes firewalls.

My question is this - is there any way to persist a TCP or WS connection initiated by the worker, and leverage that communication channel to gather the results? This is potentially a deal-breaker for us if we can’t solve the gather from the centralized scheduler.

guillaumeeb · July 4, 2024, 2:46pm

Hi @RaiinmakerWes, welcome to Dask Discourse forum!

As mentioned by @crusaderky in the post you mentioned, this does not seem to be how Dask Communications are working. But maybe there is more to it?

RaiinmakerWes · July 8, 2024, 5:51pm

Yeah it seems like that sort of persistent connection is an anti-pattern and I stopped effort chasing that down…

I’m now working with ngrok TCP Endpoints to create tcp tunnels. Running into mismatches with the worker contact-address provided vs the address used to create the new TCP connection from the scheduler in the gather step. But I may start a new topic for that since it isn’t directly related to this thread.

RaiinmakerWes · July 9, 2024, 5:19pm

Got it working I wasn’t hooking up contact-address and listen-address properly with the ngrok TCP tunnel.

guillaumeeb · July 11, 2024, 3:28pm

Nice, it would be good if you could share your solution!

RaiinmakerWes · July 18, 2024, 6:04pm

For sure!
I am working on productizing the final script, then I will absolutely make a follow up post to show a minimally reproducible solution

Topic		Replies	Views
Workers on private network, scheduler on a different network - how to make "gather" step work? Deploying Dask	4	277	June 21, 2023
Dask scheduler lost connection to high workload worker Distributed	1	509	March 21, 2022
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1226	November 3, 2023
LocalCluster deploying Deploying Dask distributed	1	230	January 15, 2023
Scheduler keeps running when client disconnects Distributed	2	656	August 8, 2022

Gather via persistent worker connection from scheduler?

Related topics