Advice for Slurm Configuration

Hello,

First off, apologies if I use confusing or incorrect terminology, I am still learning.

I am trying to set up configuration for a Slurm-enabled adaptive cluster.

Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:

Partition Name Max Nodes per Job Max Job Runtime Max resources used simultaneously Shared Node Usage Default Memory per CPU Max Memory per CPU
compute 512 8 hours no limit no 1920 MB 8000 MB

compute

This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:

Component Value
# of CPU Cores 64
# of Threads 128

Here is some output from control show partition:

PartitionName=compute
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=960 MaxMemPerCPU=3840

Here is what I have so far:

cluster = SLURMCluster(
    name='dask-cluster', 
    processes=32,
    cores=64,
    memory=f"{8000 * 64 * 0.90} MB",
    project="ab0995",
    queue="compute",
    interface='ib0',
    walltime='08:00:00',
    asynchronous=0,
    # job_extra=["--ntasks-per-node=50",],
)

Some things to mention:

  1. In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
  2. I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
  3. I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.

If I now print the job script as configured above, I get:

print(cluster.job_script())

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00

/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://

So, questions:

  1. By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
  2. I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
  3. When I let the cluster scale adaptively, it requests one worker, which immediately exits without any logs being produced (no slurm-out-??????? files are produced)

If I actually use the client to do some calculations on a big dataset (several Tb) I eventually run into memory errors (I have logs from some other configuration tests left over). Interesting here, I would have assumed the cluster asks for another Slurm node, but that is not the case…

The adapt is performed simply by:

cluster.adapter(min=1, max=10)

What would some the recommended settings be here? Any help or hints would be very appreciated!!

I think I noticed: If I get lucky, and I check the queue I can briefly see the slurm warning “invalid” pop up. So clearly my current configuration is not being accepted by the system.

Hi @pgierz

Dask makes the difference between GB and GiB. So it’s a binary division thing. Only put GiB instead of GB. By the way, the 0.9 reduction should not be needed if you use correct memory values considering the nodes you have.

Nthreads = total cores ÷ processes. 2 threads for each worker process.
Nprocs = processes. The number of worker processes you asked for.
Memory-limit = memory ÷ processes. The amount of memory for each worker process.

In dask-jobqueue interfaces you give the resources you want to use for each Slurm job. This translate in Dask terms in the command line.

At first, don’t use adapt, rather scale to begin with. This is easier to play with. But I don’t think this is the problem. I don’t know the correct command line for Slurm, but you should be able to interrogate it and have some feedback on why your job failed or didn’t start. A first guess would be the memory asked for: does your nodes have 430GB of memory available?

Best,
Guillaume.

1 Like