Hi, I am trying to start a heavy task on 64 nodes each with 32 cores. In particular, I want to have 64 dask workers to perform my task. Each worker to have 32 cores.
With my current setup I am always requesting 2048 workers, which I don’t want. I realized that I don’t want to to have 2048 workers as I was including breakpoints and I was monitoring len(cluster.scheduler.workers)
and it never reached more than ~1400 workers in various attempts with long wait times.
My goal is to have only 64 (dask) workers and 2048 nannies. But I fail to establish such a setup. I guess, I will produce a single log file then.
My current cluster and client are:
cluster = SLURMCluster(memory='256g', processes=32, cores=32, queue='photon')
client = Client(cluster)
cluster.scale(jobs = 64)
This produces multiple log files.
I tried to add to the slurm cluster constructor n_workers = 64
, but it made no change.
I partially solved the problem by passing cluster.scale(n=64)
in stead, but this lead to exremely slow computation. In this case I had 64 workers, but also only 64 nannies (found by grepping “Start Nanny/workers” in the produced output logs). This produced a single log file.
Other observation, which might be obvious: I cannot specify more than 32 cores in the cluster constructor.
I hope I made my problem clear. My general goal is to efficiently scale distributed computation on 64 nodes (not less). To my understanding proper number of workers and nannies should be the solution in the right direction. (I am confident that the data frame that I am using is capable of doing distributed computations efficiently, and problems are not coming from there.)