[Best practice] Deploy a cluster on an interactive compute node on a slurm cluster

Hi everyone and thanks for the nice opportunity to post questions in this forum!

I am working on a SLURM cluster and (for controllability and testing reasons) I want to setup a Dask cluster when running/being on an interactive session on one compute node of the Slurm cluster. Therefore neither dask_jobqueue.SLURMCluster nor an MPI approach are an option (please correct me if I got that wrong). As I am already on a compute node (aka one machine) I therefore attempted to deploy a LocalCluster (cluster = dask.distributed.LocalCluster(n_workers=128, processes=True , threads_per_worker=1)) which leads to the following chain of errors:
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x148f08e43760>>, <Task finished name='Task-12' coro=<Worker.heartbeat() done

exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:32805 after 30 s')

asyncio.exceptions.CancelledError

asyncio.exceptions.TimeoutError

OSError: Timed out during handshake while connecting to tcp://127.0.0.1:32805 after 30 s.

I tried to only post the (seemingly) important parts of the error output, but I rather guess that there might be a good solve for the problem already from the description of what I want to achieve. I apologise if I just didn’t find the right description of how to approach the scenario in the documentation.

Thank you very much for your assistance and I am happy to provide more information.

@verakye I see this was your first question, welcome to Discourse!

I am working on a SLURM cluster and (for controllability and testing reasons) I want to setup a Dask cluster when running/being on an interactive session on one compute node of the Slurm cluster. Therefore neither dask_jobqueue.SLURMCluster nor an MPI approach are an option (please correct me if I got that wrong). As I am already on a compute node (aka one machine) I therefore attempted to deploy a LocalCluster

@bryanweber and I discussed this, and we think this is alright for testing purposes but may not be recommended for production environments.

OSError: Timed out during handshake while connecting to tcp://127.0.0.1:32805 after 30 s .

A few different things can cause this error, would you mind sharing the complete error traceback, and the dask+distributed versions you’re running? It’ll allow us to help you better.

Some rough thoughts:

  • The way you are connecting to the cluster (SSH, Jupyter, etc.) might affect the scheduler-client connection and the LocalCluster setup, it’ll be great of you can share more details about this as well.
  • Can you safely use 128 cores on the cluster – are the resource available?
  • Dask is trying to connect to localhost, is that acessible from your Client machine? Sometimes there are firewall rules around localhost which can lead to this error.

Hi @verakye,

In most HPC cluster I worked with, you can submit a job from a compute node, so you can use dask-jobqueue or dask-mpi here. however, I know that some cluster configuration forbid this.

But anyway, using LocalCluster is probably a better option for testing purpose. And with 128 cores by node, you already have some power :grinning:!

@pavithraes just to clarify things considering your questions (@verakye correct me if I’m wrong): running an interactive session usually means you ssh-ed to a compute node through a job reservation. So it is as if @verakye has its own server with 128cores and as a terminal on it. It’s therefore a mono server setup, everything should be accessible.

From the error, it looks like workers take too much time to connect to scheduler. This is not the first post where I see this kind of question recently… Two things you can try:

  • Ensure you’re using a local directory that is fast enough or local to the server (/tmp ofr instance) for your workers.
  • Try to increase the 30s timeout using dask configuration files.
  • Try with less than 128 workers at first.
1 Like