Slurm cluster gres argument issue

samckinn · September 26, 2023, 9:03pm

I am trying to run on a Slurm cluster that recently changed the --gres=gpu:1 to --gres=gpu:a100:1. If I configure my cluster with --gres=gpu:1 it creates workers and runs fine, but this will not be supported by our admins. When replacing with --gres=gpu:100:1 creating workers fails with:
sbatch: error: Invalid GRES specification (with and without type identification)

I have created a script and run sbatch with the new --gres=gpu:a100:1 requirement outside of dask, and it also runs just fine.

Any help would be much appreciated.

cluster = SLURMCluster(
cores=2, # Number of cores per job
memory=“64GB”, # Memory per job
queue=“gpu”, # Queue/partition name
job_extra_directives =[
‘–gpus=1’, # Number of GPUs per job
‘–gres=gpu:1’, # Number of GPUs per job
],
walltime=“02:00:00”, # Job time limit
local_directory=“$TMPDIR”, # Temporary directory (optional)
log_directory=“logs”, # Directory for log files (optional)
)

Sean

guillaumeeb · September 27, 2023, 3:13pm

Hi @samckinn, welcome to Dask Discourse!

Just to be sure:

The typo here is only in this Discourse post, you didn’t forgot the letter ‘a’ when using dask-jobqueue?

Do you need these two directives?

In order to make your debugging easier, you should just print the dask-jobqueue generated job script (see How to debug — Dask-jobqueue 0.8.2+2.gef45c0e.dirty documentation). Then write this to a file and try to use sbatch with it.

samckinn · September 27, 2023, 6:14pm

Sorry, that was a discourse typo.

I removed the extraneous -gpus=1 argument and it now seems to work. Not sure why it works with -gres=gpu:1 but not with -gres=gpu:a100:1, but thanks for the help. The -gpus=1 should never have been there.

Sean

guillaumeeb · September 28, 2023, 3:04pm

Just to be clear, can you confirm that the following code works:

cluster = SLURMCluster(
  cores=2, # Number of cores per job
  memory="64GB", # Memory per job
  queue="gpu", # Queue/partition name
  job_extra_directives =[
  '–gres=gpu:a100:1', # Number of GPUs per job
  ],
  walltime="02:00:00", # Job time limit
  local_directory="$TMPDIR", # Temporary directory (optional)
  log_directory="logs", # Directory for log files (optional)
  )

samckinn · September 28, 2023, 4:35pm

Yep, it’s all good now. Thanks for the info on how to troubleshoot in the future.

Sean

Topic		Replies	Views
Dask deployment on SLURM Cluster with GPUs Deploying Dask dask-mpi , distributed	7	364	May 20, 2024
Memory allocation always <= 4GiB for distributed SLURMCluster workers Distributed dask-jobqueue , worker , distributed	8	738	July 12, 2022
Dask-jobqueue and SLURMCluster options Deploying Dask distributed	1	143	March 26, 2024
Dask Cluster stuck in the pending status and shutdown yourself with Dask Gateway over the Slurm HPC Cluster Deploying Dask dask-gateway , distributed	12	1177	December 27, 2023
Create a (slurm) cluster with different job submission parameters Deploying Dask dask-jobqueue	15	955	January 19, 2024

Slurm cluster gres argument issue

Related topics