Common workflow for using Dask on HPC systems

Hi! I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.
My first attempt consisted of a script that:

  1. initiated the SLURMCluster class with my custom setup
  2. obtained a client using the get_client() method of the SLURMCluster object.
  3. used map, gather, and submit methods of the Client object to distribute and manage tasks.
  4. Once all tasks are resolved, I close the client and cluster.

Then, I executed the script in login node following the procedure below:

  1. I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
  2. I run the Python script on the login node.
  3. While the script is running, I can access the Dask dashboard.

This workflow completed successfully my tasks but it has some limitations:

  • The Dask dashboard shuts down as soon as the Python script finishes.
  • I lose access to all dashboard information after the script completes.
  • This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
  • It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.

My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues?
For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?

Thanks for any guidance!

Hi @joseph-pq, welcome to Dask community!

Have you taken a look at performance_report context manager? It allows you to keep Dashboard information saved as an HTML file.

Another solution would be to have one script for starting the Dask cluster, and another script to submit job to it. You can save Scheduler information into a json file to share it to another process.

A common workflow is to submit a “master” job, which will create Scheduler and client and submit other jobs for Workers through SLURMCluster. You can also take a look at dask-mpi.

you can always create a script to run the cluster in the terminal with all of its abilities and the function to connect to the client