Hi! I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.
My first attempt consisted of a script that:
- initiated the
SLURMCluster
class with my custom setup - obtained a client using the
get_client()
method of theSLURMCluster
object. - used
map
,gather
, andsubmit
methods of theClient
object to distribute and manage tasks. - Once all tasks are resolved, I close the client and cluster.
Then, I executed the script in login node following the procedure below:
- I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
- I run the Python script on the login node.
- While the script is running, I can access the Dask dashboard.
This workflow completed successfully my tasks but it has some limitations:
- The Dask dashboard shuts down as soon as the Python script finishes.
- I lose access to all dashboard information after the script completes.
- This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
- It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.
My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues?
For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?
Thanks for any guidance!