Unable to start SSHCluster

Hi everyone,

I’m having trouble with the deployment of my first SSHCluster. I have a tool using dask and pdal for processing points clouds and I want to use the disitrbuted aspect of dask to use the cores of a calcul station.

First, I wanted to deploy a LocalCluster on the station that would turn as a windows service and I will connect to the scheduler with my actual computer to launch some work which would be run on the station. But this solution don’t work because I’m unable to connect to this LocalCluster from another machine even if they are on the same network.

So I wanted to deploy an SSHCluster on my actual machine with these parameters :

from distributed import SSHCluster

cluster = SSHCluster(hosts=['localhost', 'ip of my station'])

With these parameters, the cluster will be on my computer and I will be able to deploy many workers on my station. But when I run my code I get this error :

RuntimeError: Cluster failed to start: Multiple exceptions: [Errno 10061] Connect call failed ('127.0.0.1', 22), [Errno 10061] Connect call failed ('::1', 22, 0, 0)

I really don’t know why it"s not working. I searched for a while and I still don’t know why this error occurs.

Did someone can help me ?

Regards,

Clément

@ClementAlba Thanks for the question. I’m not able to reproduce this, but I’ll keep looking into it. Maybe @jacobtomlinson or @graingert have thoughts?

I’m also curious about:

First, I wanted to deploy a LocalCluster on the station that would turn as a windows service and I will connect to the scheduler with my actual computer to launch some work which would be run on the station. But this solution don’t work because I’m unable to connect to this LocalCluster from another machine even if they are on the same network.

LocalCluster isn’t the best practice here, and SSHCluster is the better approach. However, I would’ve expected LocalCluster to work. Could you please share details about why you couldn’t connect to it – like the error traceback? I’m asking because it might ve related to the issue you’re facing with SSHCluster.

Hi @pavithraes and thanks for your response.

For the LocalCluster :
I create a conda environment on the calcul station with Dask and ipython installed on it. Then in ipython I wrote this code :

from distributed import LocalCluster
cluster = LocalCluster(n_workers=3, threads_per_worker=1)

And I get the scheduler address with cluster.scheduler_address

On my personal laptop, I wrote a very simple python script to do some work :

from distributed import Client
import dask


def inc(x):
    return x + 1


def double(x):
    return x * 2


def add(x, y):
    return x + y


if __name__ == '__main__':
    client = Client(address='address of my scheduler running on the station')
    print(client)

    data = range(500)
    output = []
    for d in data:
        a = dask.delayed(inc)(d)
        b = dask.delayed(double)(d)
        c = dask.delayed(add)(a, b)
        output.append(c)

    total = dask.delayed(sum)(output)
    total.compute()
    client.shutdown()

And when I run it I get the following error :

OSError: Timed out trying to connect to tcp://127.0.0.1:port after 30 s

All of these errors seem to suggest that something is blocking the connection between your laptop and the station on those ports. Are there any firewalls running or network configuration in your environment that would prevent this?

@jacobtomlinson Yes, there is a SSL on my network. So the self-generated certificate can maybe block the connection. SSL can also block some ports.

In that case I suggest you reach out to whoever manages your network for help here.

Dask requires connections on ports 8786-8787 for the client to connect to the scheduler. And all scheduler-worker and worker-worker communication happens on random high ports, so those shouldn’t be blocked.

All of this is configurable if there are specific port ranges you are meant to be using on your network.