Slurm cluster gres argument issue

I am trying to run on a Slurm cluster that recently changed the --gres=gpu:1 to --gres=gpu:a100:1. If I configure my cluster with --gres=gpu:1 it creates workers and runs fine, but this will not be supported by our admins. When replacing with --gres=gpu:100:1 creating workers fails with:
sbatch: error: Invalid GRES specification (with and without type identification)

I have created a script and run sbatch with the new --gres=gpu:a100:1 requirement outside of dask, and it also runs just fine.

Any help would be much appreciated.

cluster = SLURMCluster(
cores=2, # Number of cores per job
memory=“64GB”, # Memory per job
queue=“gpu”, # Queue/partition name
job_extra_directives =[
‘–gpus=1’, # Number of GPUs per job
‘–gres=gpu:1’, # Number of GPUs per job
],
walltime=“02:00:00”, # Job time limit
local_directory=“$TMPDIR”, # Temporary directory (optional)
log_directory=“logs”, # Directory for log files (optional)
)

Sean

Hi @samckinn, welcome to Dask Discourse!

Just to be sure:

The typo here is only in this Discourse post, you didn’t forgot the letter ‘a’ when using dask-jobqueue?

Do you need these two directives?

In order to make your debugging easier, you should just print the dask-jobqueue generated job script (see How to debug — Dask-jobqueue 0.8.2+2.gef45c0e.dirty documentation). Then write this to a file and try to use sbatch with it.

Sorry, that was a discourse typo.

I removed the extraneous -gpus=1 argument and it now seems to work. Not sure why it works with -gres=gpu:1 but not with -gres=gpu:a100:1, but thanks for the help. The -gpus=1 should never have been there.

Sean

Just to be clear, can you confirm that the following code works:

cluster = SLURMCluster(
  cores=2, # Number of cores per job
  memory="64GB", # Memory per job
  queue="gpu", # Queue/partition name
  job_extra_directives =[
  '–gres=gpu:a100:1', # Number of GPUs per job
  ],
  walltime="02:00:00", # Job time limit
  local_directory="$TMPDIR", # Temporary directory (optional)
  log_directory="logs", # Directory for log files (optional)
  )

Yep, it’s all good now. Thanks for the info on how to troubleshoot in the future.

Sean

1 Like