SLURMCluster on 64 nodes / Understanding Cluster scale method

ikabadzhov · November 10, 2021, 10:37am

Hi, I am trying to start a heavy task on 64 nodes each with 32 cores. In particular, I want to have 64 dask workers to perform my task. Each worker to have 32 cores.

With my current setup I am always requesting 2048 workers, which I don’t want. I realized that I don’t want to to have 2048 workers as I was including breakpoints and I was monitoring len(cluster.scheduler.workers) and it never reached more than ~1400 workers in various attempts with long wait times.

My goal is to have only 64 (dask) workers and 2048 nannies. But I fail to establish such a setup. I guess, I will produce a single log file then.

My current cluster and client are:

cluster = SLURMCluster(memory='256g', processes=32, cores=32, queue='photon')
client = Client(cluster)
cluster.scale(jobs = 64)

This produces multiple log files.

I tried to add to the slurm cluster constructor n_workers = 64, but it made no change.

I partially solved the problem by passing cluster.scale(n=64) in stead, but this lead to exremely slow computation. In this case I had 64 workers, but also only 64 nannies (found by grepping “Start Nanny/workers” in the produced output logs). This produced a single log file.

Other observation, which might be obvious: I cannot specify more than 32 cores in the cluster constructor.

I hope I made my problem clear. My general goal is to efficiently scale distributed computation on 64 nodes (not less). To my understanding proper number of workers and nannies should be the solution in the right direction. (I am confident that the data frame that I am using is capable of doing distributed computations efficiently, and problems are not coming from there.)

ikabadzhov · November 10, 2021, 10:46am

A subquestion, that would be more important for me first is: how can I start a job with 32 nannies and 1 worker?

mrocklin · November 11, 2021, 6:07pm

The parameters that you give to the constructor define a single job, so it sounds like you want …

cluster = SLURMCluster(processes=1, cores=32, ...)
cluster.scale(jobs=64)

guillaumeeb · November 12, 2021, 2:50pm

To be really clear: this is impossible. 1 worker is always having one nanny.

To explain a little more, Dask is a bit different from Spark, or Python is different from Java. Every processes you launch is a worker, there is no such thing as workers with 32 processes.

The closest to what you want is what @mrocklin proposed, but here you’ll got 64 workers with 32 threads each. Depending on your computation, this might not be what you want. But I encourage you to try though!

ikabadzhov · November 17, 2021, 11:26am

Thank you both @guillaumeeb @mrocklin . I understand that my initial plan was not realistic.

But I still could not run on 64 nodes.

With this setup I had N number of job scripts, where N is the number of nodes.

I further added job arrays, so I now run on 32 nodes with 1 job script (again using SLURMCluster). I am waiting for 64 free nodes to check if the problem comes from number of jobs, or it is rather something else (potentially internal).

I will report once I have results.

Topic		Replies	Views
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	258	November 9, 2023
Memory allocation always <= 4GiB for distributed SLURMCluster workers Distributed dask-jobqueue , worker , distributed	8	733	July 12, 2022
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	321	April 23, 2022
Very low CPU utilization on SLURMCluster Distributed distributed , performance	3	1033	November 8, 2021
Dask distributed with very large queue Distributed	5	1073	November 13, 2022

SLURMCluster on 64 nodes / Understanding Cluster scale method

Related topics