Common workflow for using Dask on HPC systems

Hi! I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.
My first attempt consisted of a script that:

  1. initiated the SLURMCluster class with my custom setup
  2. obtained a client using the get_client() method of the SLURMCluster object.
  3. used map, gather, and submit methods of the Client object to distribute and manage tasks.
  4. Once all tasks are resolved, I close the client and cluster.

Then, I executed the script in login node following the procedure below:

  1. I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
  2. I run the Python script on the login node.
  3. While the script is running, I can access the Dask dashboard.

This workflow completed successfully my tasks but it has some limitations:

  • The Dask dashboard shuts down as soon as the Python script finishes.
  • I lose access to all dashboard information after the script completes.
  • This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
  • It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.

My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues?
For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?

Thanks for any guidance!