Using Dask SSH Cluster

I am exploring different ways of using multi-node (multi-host) clusters in Dask. The SSH option caught my eye. I am using 2 Windows machines for now (will eventually extend to more). I setup public key SSH connection between node 0 and node 1. My first attempt was to setup the scheduler on node 0, and configure both node 0 and node 1 to create “n” workers. My first question is whether it would be possible to configure different nodes within the cluster to each start a different number of workers?

Initially I tried as follows:

    num_workers = 6

    cluster = SSHCluster(["localhost", "localhost", "NODE-FDQJ136P"],
        connect_options={"known_hosts": None},
        worker_options={"n_workers": num_workers},
        scheduler_options={"port": 0, "dashboard_address": ":8797"})

However, I got an error message indicating that SSH to localhost was not possible. This seemed weird at first, because why would localhost need to SSH into localhost, I initially thought. Upon further thought, it seems plausible that the scheduler is running in a different process and would still need to SSH into localhost to start worker processes even if it is on the same machine. I spent some time trying to configure localhost to connect via SSH to basically itself. I had issues to achieve this because I was using a domain user, so I created a local admin user on my PC and this allowed me to SSH from localhost to localhost.

My second attempt was the same as the previous but removing the “NODE-FDQJ136P” node from the host array, i.e. I am trying to start both the scheduler and 6 workers on localhost. Now I am getting this error message which I couldn’t find a lot of information about:

Cluster failed to start: Scheduler failed to set DASK_INTERNAL_INHERIT_CONFIG variable

From the documentation, it seems that this variable is used to copy the config to other nodes. The user I am logged in with is a local admin, and has access to the system environment variables. But I have no further ideas of what I might try.

Any suggestions would be appreciated.

In case anyone ever bumps into this, the issue I was encountering seemed to be the same as reported here. Sure enough the suggested fix, allows me to get past this step. Unfortunately, however, soon after, I ran into an issue with running a command which is “too long”, when serializing the config and running a command based on it (presumably to create the same config on the “remote” node). Reported as an issue here and currently awaiting feedback.

Hi @jurgencuschieri, welcome to Dask community!

Glad you found this first fix!

The Windows problem link you mention in the issue doesn’t specifies Windows 11 as affected, but I guess if you’re encountering it it is. I guess there’s nothing more we can say here, you identified the problem, and it needs a fix. Or maybe you would be willing to contribute?

Hi Guillaume,

Thanks for getting back to me. I was also surprised that it happened to me with Windows 10; indeed Windows 10 is not mentioned in the Microsoft link. I hadn’t considered trying to fix the issue with the library myself, to be honest, simply because it is not something which I had done before. Was in fact considering installing a Linux VM and working with Linux instead. If I were to consider contributing, what would you consider an alternative approach that would not depend on serializing the config? I would of course have to understand the underlying implementation, but I might consider it!

Well, I think I would consider the first workaround from your link:

Modify programs that require long command lines so that they use a file that contains the parameter information, and then include the name of the file in the command line.

I would serialize the config to a file, then scp this file to every servers. Not sure if this can be done easily though.

I gave this a shot. Possibly managed to get a bit further. In the hosts param of the SSHCluster I am specifying “localhost” and “localhost”, for the worker and scheduler, respectively. When calling the “await super().start()” co-routine from the “start” function of the “ssh.py” module, I am getting the following error “missing port number in address ‘localhost’”, on this line:

self.scheduler_comm = rpc(
getattr(self.scheduler, “external_address”, None)
or self.scheduler.address,
connection_args=self.security.get_connection_args(“client”),
)

where self.scheduler.address is “localhost”. Any idea what I might be missing?

I suspect none of this has been thoroughly tested on Windows. My guess is that Windows expects you to explicitly specify port 22 so we may need to include that.

1 Like