Advice for Slurm Configuration

pgierz · June 17, 2022, 10:01am

Hello,

First off, apologies if I use confusing or incorrect terminology, I am still learning.

I am trying to set up configuration for a Slurm-enabled adaptive cluster.

Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:

Partition Name	Max Nodes per Job	Max Job Runtime	Max resources used simultaneously	Shared Node Usage	Default Memory per CPU	Max Memory per CPU
compute	512	8 hours	no limit	no	1920 MB	8000 MB

compute

This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:

Component	Value
# of CPU Cores	64
# of Threads	128

Here is some output from control show partition:

PartitionName=compute
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=960 MaxMemPerCPU=3840

Here is what I have so far:

cluster = SLURMCluster(
    name='dask-cluster', 
    processes=32,
    cores=64,
    memory=f"{8000 * 64 * 0.90} MB",
    project="ab0995",
    queue="compute",
    interface='ib0',
    walltime='08:00:00',
    asynchronous=0,
    # job_extra=["--ntasks-per-node=50",],
)

Some things to mention:

In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.

If I now print the job script as configured above, I get:

print(cluster.job_script())

#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00

/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://

So, questions:

By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
When I let the cluster scale adaptively, it requests one worker, which immediately exits without any logs being produced (no slurm-out-??????? files are produced)

If I actually use the client to do some calculations on a big dataset (several Tb) I eventually run into memory errors (I have logs from some other configuration tests left over). Interesting here, I would have assumed the cluster asks for another Slurm node, but that is not the case…

The adapt is performed simply by:

cluster.adapter(min=1, max=10)

What would some the recommended settings be here? Any help or hints would be very appreciated!!

pgierz · June 17, 2022, 10:06am

I think I noticed: If I get lucky, and I check the queue I can briefly see the slurm warning “invalid” pop up. So clearly my current configuration is not being accepted by the system.

guillaumeeb · July 10, 2022, 2:15pm

Hi @pgierz

Dask makes the difference between GB and GiB. So it’s a binary division thing. Only put GiB instead of GB. By the way, the 0.9 reduction should not be needed if you use correct memory values considering the nodes you have.

Nthreads = total cores ÷ processes. 2 threads for each worker process.
Nprocs = processes. The number of worker processes you asked for.
Memory-limit = memory ÷ processes. The amount of memory for each worker process.

In dask-jobqueue interfaces you give the resources you want to use for each Slurm job. This translate in Dask terms in the command line.

At first, don’t use adapt, rather scale to begin with. This is easier to play with. But I don’t think this is the problem. I don’t know the correct command line for Slurm, but you should be able to interrogate it and have some feedback on why your job failed or didn’t start. A first guess would be the memory asked for: does your nodes have 430GB of memory available?

Best,
Guillaume.

Topic		Replies	Views
SLURMCluster on 64 nodes / Understanding Cluster scale method Distributed distributed	4	302	November 17, 2021
Very low CPU utilization on SLURMCluster Distributed distributed , performance	3	1029	November 8, 2021
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	317	April 23, 2022
Dask-jobqueue and SLURMCluster options Deploying Dask distributed	1	127	March 26, 2024
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	241	November 9, 2023

Advice for Slurm Configuration

compute

Related topics