Using Dask SSH Cluster

jurgencuschieri · August 27, 2023, 10:56pm

I am exploring different ways of using multi-node (multi-host) clusters in Dask. The SSH option caught my eye. I am using 2 Windows machines for now (will eventually extend to more). I setup public key SSH connection between node 0 and node 1. My first attempt was to setup the scheduler on node 0, and configure both node 0 and node 1 to create “n” workers. My first question is whether it would be possible to configure different nodes within the cluster to each start a different number of workers?

Initially I tried as follows:

    num_workers = 6

    cluster = SSHCluster(["localhost", "localhost", "NODE-FDQJ136P"],
        connect_options={"known_hosts": None},
        worker_options={"n_workers": num_workers},
        scheduler_options={"port": 0, "dashboard_address": ":8797"})

However, I got an error message indicating that SSH to localhost was not possible. This seemed weird at first, because why would localhost need to SSH into localhost, I initially thought. Upon further thought, it seems plausible that the scheduler is running in a different process and would still need to SSH into localhost to start worker processes even if it is on the same machine. I spent some time trying to configure localhost to connect via SSH to basically itself. I had issues to achieve this because I was using a domain user, so I created a local admin user on my PC and this allowed me to SSH from localhost to localhost.

My second attempt was the same as the previous but removing the “NODE-FDQJ136P” node from the host array, i.e. I am trying to start both the scheduler and 6 workers on localhost. Now I am getting this error message which I couldn’t find a lot of information about:

Cluster failed to start: Scheduler failed to set DASK_INTERNAL_INHERIT_CONFIG variable

From the documentation, it seems that this variable is used to copy the config to other nodes. The user I am logged in with is a local admin, and has access to the system environment variables. But I have no further ideas of what I might try.

Any suggestions would be appreciated.

jurgencuschieri · August 28, 2023, 8:02pm

In case anyone ever bumps into this, the issue I was encountering seemed to be the same as reported here. Sure enough the suggested fix, allows me to get past this step. Unfortunately, however, soon after, I ran into an issue with running a command which is “too long”, when serializing the config and running a command based on it (presumably to create the same config on the “remote” node). Reported as an issue here and currently awaiting feedback.

guillaumeeb · August 30, 2023, 2:51pm

Hi @jurgencuschieri, welcome to Dask community!

Glad you found this first fix!

The Windows problem link you mention in the issue doesn’t specifies Windows 11 as affected, but I guess if you’re encountering it it is. I guess there’s nothing more we can say here, you identified the problem, and it needs a fix. Or maybe you would be willing to contribute?

jurgencuschieri · August 30, 2023, 7:54pm

Hi Guillaume,

Thanks for getting back to me. I was also surprised that it happened to me with Windows 10; indeed Windows 10 is not mentioned in the Microsoft link. I hadn’t considered trying to fix the issue with the library myself, to be honest, simply because it is not something which I had done before. Was in fact considering installing a Linux VM and working with Linux instead. If I were to consider contributing, what would you consider an alternative approach that would not depend on serializing the config? I would of course have to understand the underlying implementation, but I might consider it!

guillaumeeb · August 30, 2023, 8:17pm

Well, I think I would consider the first workaround from your link:

Modify programs that require long command lines so that they use a file that contains the parameter information, and then include the name of the file in the command line.

I would serialize the config to a file, then scp this file to every servers. Not sure if this can be done easily though.

jurgencuschieri · August 30, 2023, 9:39pm

I gave this a shot. Possibly managed to get a bit further. In the hosts param of the SSHCluster I am specifying “localhost” and “localhost”, for the worker and scheduler, respectively. When calling the “await super().start()” co-routine from the “start” function of the “ssh.py” module, I am getting the following error “missing port number in address ‘localhost’”, on this line:

self.scheduler_comm = rpc(
getattr(self.scheduler, “external_address”, None)
or self.scheduler.address,
connection_args=self.security.get_connection_args(“client”),
)

where self.scheduler.address is “localhost”. Any idea what I might be missing?

jacobtomlinson · September 4, 2023, 10:35am

I suspect none of this has been thoroughly tested on Windows. My guess is that Windows expects you to explicitly specify port 22 so we may need to include that.

Topic		Replies	Views
Unable to start SSHCluster Distributed distributed	5	879	August 18, 2022
Dask Distributed SSHCluster Client Node with Dual network Deploying Dask	2	114	January 6, 2024
Dask SSH Cluster setup Distributed distributed	10	668	May 9, 2024
Creating a dask SSHCluster with a tunnel Deploying Dask	4	61	September 3, 2024
Clarification sought on local scheduler and remote worker set up though SSHCluster Distributed distributed	1	33	April 18, 2025

Using Dask SSH Cluster

Related topics