I am trying to run on a Slurm cluster that recently changed the --gres=gpu:1 to --gres=gpu:a100:1. If I configure my cluster with --gres=gpu:1 it creates workers and runs fine, but this will not be supported by our admins. When replacing with --gres=gpu:100:1 creating workers fails with:
sbatch: error: Invalid GRES specification (with and without type identification)
I have created a script and run sbatch with the new --gres=gpu:a100:1 requirement outside of dask, and it also runs just fine.
Any help would be much appreciated.
cluster = SLURMCluster(
cores=2, # Number of cores per job
memory=“64GB”, # Memory per job
queue=“gpu”, # Queue/partition name
job_extra_directives =[
‘–gpus=1’, # Number of GPUs per job
‘–gres=gpu:1’, # Number of GPUs per job
],
walltime=“02:00:00”, # Job time limit
local_directory=“$TMPDIR”, # Temporary directory (optional)
log_directory=“logs”, # Directory for log files (optional)
)
I removed the extraneous -gpus=1 argument and it now seems to work. Not sure why it works with -gres=gpu:1 but not with -gres=gpu:a100:1, but thanks for the help. The -gpus=1 should never have been there.
Just to be clear, can you confirm that the following code works:
cluster = SLURMCluster(
cores=2, # Number of cores per job
memory="64GB", # Memory per job
queue="gpu", # Queue/partition name
job_extra_directives =[
'–gres=gpu:a100:1', # Number of GPUs per job
],
walltime="02:00:00", # Job time limit
local_directory="$TMPDIR", # Temporary directory (optional)
log_directory="logs", # Directory for log files (optional)
)