Some gpus are idling when running dask-mpi on HPC

I would like to use the dask-mpi on HPC. When I use the “distributed.Worker”, everything is ok. But the problem is the “dask_cuda.CUDAWorker”. I know in the dask-mpi, rank0 is scheduler and rank1 is client process, and the remaining ranks are workers. The HPC have 4 gpus per node, and when I run mpiexec -np 3 python test.py on ONE node, rank 2 have 4 gpu workers. It means each rank represents 4 gpus.
When using multi-node, we must specify -npernode, or all the processes will run on the same node. Here comes the problem. For example, I want 12 gpu workers running on 3 nodes, and rank 2,3,4 have 4 gpu workers respectively. Therefore , we will have 5 process. As I have mentioned above, -npernode is required, if we only run mpiexec -hostfile ${hostfile} -np 5 python test.py, 12 workers will run on the same node. In this case, I print the client print(client) and it shows <Client: 'ucx://10.110.0.4:44489' processes=3 threads=3, memory=330.00 GiB>.
If we run mpiexec -hostfile ${hostfile} -np 5 -npernode 2 python test.py, rank0 and rank1 are on node1, and rank2 and rank3 are on node2, and rank4 is on node3. So node1 have no gpu worker, all the gpus are ideling. Node2 have 8 gpu workers and node3 have 4 gpu workers. In this case, the client is <Client: 'ucx://10.110.0.4:48231' processes=0 threads=0, memory=0 B>.
If we run mpiexec -hostfile ${hostfile} -np 3 -npernode 1 python test.py, obviously only node3 have 4 gpu workers, and the gpus on node1,2 are ideling.
The mpi on the HPC is openmpi 4.0.3.
Is there any way or strategiy to solve this problem? Because keeping the gpus idling is waste for computational resources.

Hi @lx_dask_forum, welcome back!

I understand your problem. Unfortunately, there is no easy solution to it built-in dask-mpi.

Some solutions that I can think of:

  • Submit two different MPI jobs: one with Scheduler and Client only, typically on a node without a GPU, and another MPI job with Workers only. The second way is explained here. However, your two jobs will be disconnected and might not start at the same time.
  • Start a Scheduler on an interactive node, use again dask-mpi CLI with --no-scheduler option to connect to it, and run also your Client script from the interactive node.

Maybe @kmpaul has some other thought?

Thank you. I tried to use wrapper to wrap scheduler, worker and dask program, but failed.

job_script.sh :

#load necessary modules
.....

#wrapper of scheduler,worker,program
scheduler=scheduler.sh
workers=worker.sh
program=program.sh

#use xargs to run the 3 scripts parallelly
ls $workers $scheduler $program | xargs -P3 -n1 bash

wrapper of scheduler scheduler.sh :

#!/bin/bash
#launch conda env
source ${CONDA_PREFIX}/etc/profile.d/conda.sh
source activate rapids-23.02

#launch dask scheduler
SCHEDULER=./scheduler.json
dask-mpi  --scheduler-file $SCHEDULER  --no-nanny

wrapper of worker worker.sh :

#!/bin/bash
#sleep 3s then execute the runwork.sh on 4 nodes
sleep 3
mpiexec ${NQSII_MPIOPTS} -np 4 -npernode 1 -x PATH -x LD_LIBRARY_PATH ./runworker.sh

wrapper of runworker.sh :

#!/bin/bash
#launch conda env on 4 nodes
source ${CONDA_PREFIX}/etc/profile.d/conda.sh
source activate rapids-23.02

#launch worker
SCHEDULER=./scheduler.json
dask-mpi --scheduler-file $SCHEDULER --no-scheduler --no-nanny

The log is following. It seems that the worker did not launch.

log
2023-04-20 22:37:33,748 - distributed.scheduler - INFO - State start
/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/distributed/utils.py:166: RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to hostname: [Errno 101] Network is unreachable
  warnings.warn(
2023-04-20 22:37:33,768 - distributed.scheduler - INFO -   Scheduler at:    tcp://10.120.0.5:41441
2023-04-20 22:37:33,768 - distributed.scheduler - INFO -   dashboard at:                     :8787
Traceback (most recent call last):
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/bin/dask-mpi", line 10, in <module>
Traceback (most recent call last):
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/bin/dask-mpi", line 10, in <module>
Traceback (most recent call last):
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/bin/dask-mpi", line 10, in <module>
Traceback (most recent call last):
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/bin/dask-mpi", line 10, in <module>
    sys.exit(main())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    sys.exit(main())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    sys.exit(main())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    sys.exit(main())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1055, in main
    return self.main(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1055, in main
    return self.main(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1055, in main
    return self.main(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    rv = self.invoke(ctx)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    rv = self.invoke(ctx)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    rv = self.invoke(ctx)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 147, in main
    return __callback(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 147, in main
    return __callback(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 147, in main
    return __callback(*args, **kwargs)
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 147, in main
    asyncio.get_event_loop().run_until_complete(run_worker())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    asyncio.get_event_loop().run_until_complete(run_worker())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    asyncio.get_event_loop().run_until_complete(run_worker())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    asyncio.get_event_loop().run_until_complete(run_worker())
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 128, in run_worker
    return future.result()
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 128, in run_worker
    return future.result()
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 128, in run_worker
    return future.result()
  File "/home/TRG/.conda/envs/rapids-23.02-openmpi/lib/python3.10/site-packages/dask_mpi/cli.py", line 128, in run_worker
    raise DeprecationWarning(
DeprecationWarning: Option --no-nanny is deprectaed, use --worker-class instead
    raise DeprecationWarning(
DeprecationWarning: Option --no-nanny is deprectaed, use --worker-class instead
    raise DeprecationWarning(
DeprecationWarning: Option --no-nanny is deprectaed, use --worker-class instead
    raise DeprecationWarning(
DeprecationWarning: Option --no-nanny is deprectaed, use --worker-class instead
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10124,1],3]
  Exit code:    1
--------------------------------------------------------------------------
2023-04-20 22:37:43,629 - distributed.scheduler - INFO - Receive client connection: Client-870b7719-df80-11ed-8ec7-3cecefd36ada
2023-04-20 22:37:43,948 - distributed.core - INFO - Starting established connection to tcp://10.120.0.5:55982
2023-04-20 22:37:53,963 - distributed.scheduler - INFO - Remove client Client-870b7719-df80-11ed-8ec7-3cecefd36ada
2023-04-20 22:37:53,963 - distributed.core - INFO - Received 'close-stream' from tcp://10.120.0.5:55982; closing.
2023-04-20 22:37:53,963 - distributed.scheduler - INFO - Remove client Client-870b7719-df80-11ed-8ec7-3cecefd36ada
2023-04-20 22:37:53,963 - distributed.scheduler - INFO - Close client connection: Client-870b7719-df80-11ed-8ec7-3cecefd36ada
start
client: <Client: 'tcp://10.120.0.5:41441' processes=0 threads=0, memory=0 B>

%NQSV(INFO): Batch job received signal SIGKILL. (Exceeded per-req elapse time limit)

============================================================
Request ID:             30705.nqsv
Request Name:           dasktest
Queue:                  gpu@nqsv
Number of Jobs:         4
User Name:              luoxiao
Group Name:             TRG
Created Request Time:   Thu Apr 20 22:36:41 2023
Started Request Time:   Thu Apr 20 22:37:32 2023
Ended Request Time:     Thu Apr 20 22:39:00 2023
Resources Information:
  Elapse:               69S
  Remaining Elapse:     0S
============================================================

%NQSV(INFO): ------- Output from job:0001 -------
Loading openmpi/4.1.5/gcc9.4.0-cuda11.8.0
  Loading requirement: cuda/11.8.0 ucx/1.13.1/cuda11.8.0

%NQSV(INFO): ------- Output from job:0002 -------
Loading openmpi/4.1.5/gcc9.4.0-cuda11.8.0
  Loading requirement: cuda/11.8.0 ucx/1.13.1/cuda11.8.0

%NQSV(INFO): ------- Output from job:0003 -------
Loading openmpi/4.1.5/gcc9.4.0-cuda11.8.0
  Loading requirement: cuda/11.8.0 ucx/1.13.1/cuda11.8.0

I also tried to run mpirun -np 4 dask-mpi --no-nanny --scheduler-file scheduler.json on the interactive node, but failed like above.

I’m not sure of what you are doing. Did you submit one HPC job or 3 different ones?

I was more or less suggesting either to submit several jobs. Looking back at dask-mpi documentation, I don’t think my first solution will work. Something that should work (see this link) would be to:

  1. Submit a job that starts a scheduler and write its information in a JSON document. You might also run this on a head node, but it might not work in some HPC cluster due to network configuration. So inside the job script:
dask-scheduler --scheduler-file /path/to/scheduler.json
  1. Submit a MPI job on your GPU partition, using dask-mpi:
mpirun -np 4 dask-mpi --worker-class dask_cuda.CUDAWorker --no-nanny --scheduler-file  /path/to/scheduler.json --no-scheduler 
  1. Submit a job to run your Dask code, or just launch it from a head node:
python test.py --scheduler-file /path/to/scheduler.json #You'll have to handle your code reading the scheduler file

Thank you. What I did is just submitting the job script “job_script.sh” to HPC, which the job will use 4 nodes. Then the command ls $workers $scheduler $program | xargs -P3 -n1 bash will execute the following command

dask-mpi --scheduler-file $SCHEDULER
mpiexec ${NQSII_MPIOPTS} -np 4 -npernode 1 -x PATH -x LD_LIBRARY_PATH \
        dask-mpi --scheduler-file $SCHEDULER --no-scheduler --no-nanny
python test.py

on the same time. But this didn’t work.
The page Dask-MPI with Interactive Jobs — Dask-MPI 2022.4.0+25.g9b6951b.dirty documentation tells this command mpirun -np 4 dask-mpi --scheduler-file scheduler.json will create a scheduler on rank 0 and Dask Workers (or Nannies) on all remaining MPI ranks. But it doesn’t work on the interactive node at our HPC.
The nodes in the HPC all have gpu, only the login nodes are non-gpu.

It seems the problem can be solved by adding --map-by node after -np process_num like mpiexec -hostfile ${myhostfile} -np 10 --map-by node. This option will let mpiexec launch the processes cycling by node. For example, I run the above command on 8 nodes with the --map-by node, then the rank0 is scheduler and on node1, rank1 is client on node2. The worker 1~8 are launched on node3~8, and node1 and 2. By applying this option, I can use all the GPUs on all nodes.

1 Like

Yep, that probably won’t work.

Yes, that would only work inside a job submission. One job submission per mpi command.

Well, this is great!! It seems you found a workaround to your problem?

Yes. Just let let the processes map the nodes cycling by node, then the last two ranks will be on the same node with scheduler and client respectively. The option of openmpi is --map-by node, I’m not sure if other mpis have the same command or not, but I think they would provide the same feature.

1 Like