Common workflow for using Dask on HPC systems

joseph-pq · October 16, 2024, 8:27pm

Hi! I’m new to Dask. I’m currently working in an HPC managed by SLURM with some compute nodes (those that execute the jobs) and the login node (which I access through SSH to send the SLURM jobs). I’m looking to define my workflow for distributing my tasks to different CPUs from different nodes.
My first attempt consisted of a script that:

initiated the SLURMCluster class with my custom setup
obtained a client using the get_client() method of the SLURMCluster object.
used map, gather, and submit methods of the Client object to distribute and manage tasks.
Once all tasks are resolved, I close the client and cluster.

Then, I executed the script in login node following the procedure below:

I SSH into the login node using port forwarding (to access the Dask dashboard on port 8787 from my local machine).
I run the Python script on the login node.
While the script is running, I can access the Dask dashboard.

This workflow completed successfully my tasks but it has some limitations:

The Dask dashboard shuts down as soon as the Python script finishes.
I lose access to all dashboard information after the script completes.
This workflow forces me to keep a long-running process on the login node, which I’d like to avoid.
It also requires keeping my SSH session open, which is risky if my local machine shuts down or the connection is lost.

My question: Is there a more common or better approach for managing Dask tasks on an HPC system like this, while avoiding these issues?
For example, how can I keep the Dask scheduler and dashboard running independently of the script execution?

Thanks for any guidance!

guillaumeeb · October 18, 2024, 4:24pm

Hi @joseph-pq, welcome to Dask community!

Have you taken a look at performance_report context manager? It allows you to keep Dashboard information saved as an HTML file.

Another solution would be to have one script for starting the Dask cluster, and another script to submit job to it. You can save Scheduler information into a json file to share it to another process.

A common workflow is to submit a “master” job, which will create Scheduler and client and submit other jobs for Workers through SLURMCluster. You can also take a look at dask-mpi.

Hvuj · October 18, 2024, 4:32pm

you can always create a script to run the cluster in the terminal with all of its abilities and the function to connect to the client

Topic		Replies	Views
Connect remote scheduler to dask_jobqueue.SLURMCluster Distributed	5	1098	January 4, 2022
Advice on how to structure Dask computation Distributed	7	51	January 16, 2025
Dask dashboard crashes but client is still running dashboard	7	33	February 7, 2025
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	319	April 23, 2022
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	252	November 9, 2023

Common workflow for using Dask on HPC systems

Related topics