At CERN we have a Jupyter notebook service that we are now integrating with HTCondor resources, and we would like to use those resources via Dask.
The setup is the following: users log in to the notebook service and get a user session, which runs in a Docker container. Inside their session, users should be able to create a Dask HTCondorCluster to deploy Dask workers on our HTCondor pool. The problem we have is that the address that the scheduler binds to can’t be the same as the address workers use to contact the scheduler. The scheduler runs inside the container, and should listen on an address:port of the private network of the container. However, the workers (which are running in another network in the HTCondor pool) should contact the scheduler on an address:port of the node that hosts the user container, for which we would setup port forwarding to reach the container.
So far we haven’t found any way for the workers to receive a different scheduler address than the address the scheduler binds to. We found this:
but that only allows to specify a different address for the client to contact the scheduler (i.e. the scheduler must still bind to the same address that the workers receive).
What would be the way to configure a setup like the one I just described and make it possible for workers to connect to the scheduler?
Thank you for sharing @oshadura , but I believe that patch is not strictly related to the issue I described above (I’d like to find a solution for workers to be told a different scheduler address than the one the scheduler binds to).
Hi @etejedor and welcome to discourse! At the moment, dask-jobqueue unfortunately doesn’t support this, but I would recommend opening an issue there. Depending on your notebook server environment, you might be able to use batchspawner. Using dask-gateway might be another option as well.
Hi @scharlottej13 thank you for you reply and the suggestions, I think opening an issue is probably the best option. Batchspawner is not an option for us since we use the k8s spawner (the notebook servers don’t run on the HTCondor cluster) and dask-gateway is certainly something to keep an eye on, but a simpler setup would be better to start I think.
Would it be better to open the issue on dask-jobqueue or on distributed? The necessary changes would likely imply a new parameter for the scheduler.
Would it be better to open the issue on dask-jobqueue or on distributed? The necessary changes would likely imply a new parameter for the scheduler.
@etejedor I’d suggest starting the discussion on dask-jobqueue, and perhaps opening a follow-up issue on distributed later based on it – what do you think?
@etejedor Thanks for opening that issue! I think we can continue the discussion there (to avoid duplication), so I’ll mark this Discourse thread as resolved.