Dask scheduler in a docker container, workers as HTCondor jobs

Hello,

At CERN we have a Jupyter notebook service that we are now integrating with HTCondor resources, and we would like to use those resources via Dask.

The setup is the following: users log in to the notebook service and get a user session, which runs in a Docker container. Inside their session, users should be able to create a Dask HTCondorCluster to deploy Dask workers on our HTCondor pool. The problem we have is that the address that the scheduler binds to can’t be the same as the address workers use to contact the scheduler. The scheduler runs inside the container, and should listen on an address:port of the private network of the container. However, the workers (which are running in another network in the HTCondor pool) should contact the scheduler on an address:port of the node that hosts the user container, for which we would setup port forwarding to reach the container.

So far we haven’t found any way for the workers to receive a different scheduler address than the address the scheduler binds to. We found this:

but that only allows to specify a different address for the client to contact the scheduler (i.e. the scheduler must still bind to the same address that the workers receive).

What would be the way to configure a setup like the one I just described and make it possible for workers to connect to the scheduler?

Thank you,

Enric

Hi @etejedor, we were working on a similar setup for analysis facility at University Lincoln-Nebraska (cc @bbockelm), and I believe you could be interested in my PR Preserve worker hostname by oshadura · Pull Request #4938 · dask/distributed · GitHub, I will try to wrap up it next week.

Thank you for sharing @oshadura , but I believe that patch is not strictly related to the issue I described above (I’d like to find a solution for workers to be told a different scheduler address than the one the scheduler binds to).

Hi @etejedor and welcome to discourse! At the moment, dask-jobqueue unfortunately doesn’t support this, but I would recommend opening an issue there. Depending on your notebook server environment, you might be able to use batchspawner. Using dask-gateway might be another option as well.

@etejedo I think we have also other patch that could be useful. I will open as a PR next week https://github.com/CoffeaTeam/coffea-casa/blob/master/docker/coffea-casa-cc7/distributed/0004-Add-possibility-to-setup-external_adress-for-schedul.patch

Hi @scharlottej13 thank you for you reply and the suggestions, I think opening an issue is probably the best option. Batchspawner is not an option for us since we use the k8s spawner (the notebook servers don’t run on the HTCondor cluster) and dask-gateway is certainly something to keep an eye on, but a simpler setup would be better to start I think.

Would it be better to open the issue on dask-jobqueue or on distributed? The necessary changes would likely imply a new parameter for the scheduler.

Thanks @oshadura that looks like a possible solution!

It will probably need to be adapted to avoid a clash with Add support for separate external address for SpecCluster scheduler by jacobtomlinson · Pull Request #2963 · dask/distributed · GitHub , which already defines an external_address option with a different meaning (i.e. the address the client uses to connect to the scheduler, not the address the workers use as in your patch).

Would it be better to open the issue on dask-jobqueue or on distributed? The necessary changes would likely imply a new parameter for the scheduler.

@etejedor I’d suggest starting the discussion on dask-jobqueue, and perhaps opening a follow-up issue on distributed later based on it – what do you think?

That works for me, let me just ping @oshadura since she said she’s going to open a PR with her patch.

1 Like

This is now Configure Dask workers to contact scheduler on a specific address · Issue #548 · dask/dask-jobqueue · GitHub

1 Like

@etejedor Thanks for opening that issue! I think we can continue the discussion there (to avoid duplication), so I’ll mark this Discourse thread as resolved. :smile:

1 Like