Dask deployment on SLURM Cluster with GPUs

I am trying to get XGBoost with dask setup on a SLURM cluster and was able to get this setup with using only CPUs. However, I want to use GPUs to accelerate the training. To start with I did just a basic calculation which is shown in the code:

from dask_mpi import initialize
import os
import dask.array as da
from dask.distributed import Client
import time
from contextlib import contextmanager
from distributed.scheduler import logger
import socket

@contextmanager
def timed(txt):
    t0 = time.time()
    yield
    t1 = time.time()
    print("%32s time:  %8.5f" % (txt, t1 - t0))

def example_function():
    print(f"start example")
    x = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
    y = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
    z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1, 2)).compute()
    print(z)



if __name__ == "__main__":
    initialize(worker_class="dask_cuda.CUDAWorker", local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0")
    #initialize(local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0", nthreads=int(os.environ['SLURM_CPUS_PER_TASK']))
client = Client()

    host = client.run_on_scheduler(socket.gethostname)
    port = client.scheduler_info()['services']['dashboard']
    login_node_address = "zvladimi@login.zaratan.umd.edu" # Change this to the address/domain of your login node

    logger.info(f"ssh -N -L {port}:{host}:{port} {login_node_address}")

    with timed("test"):
        example_function()

The commented out initialize is what I used for running without GPUs. In addition I have this submission script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --time=01:00:00

#SBATCH -p gpu
#SBATCH --mem-per-cpu=40000
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=3
#SBATCH --gpus-per-task=a100:1
#SBATCH --mail-type=ALL

mpirun python3 -u /home/zvladimi/scratch/MLOIS/dask_gpu.py

ECODE=$?
echo "Job finished with exit code $ECODE."

Which when I ran with just GPUs was the same script but not on the GPU partition and without the --gpus-per-task tag.

When I run this I get the following output with the main error seeming to be File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup os.sched_setaffinity(0, self.cores) OSError: [Errno 22] Invalid argument

:

2024-05-06 09:31:17,645 - distributed.scheduler - INFO - State start
2024-05-06 09:31:17,673 - distributed.scheduler - INFO -   Scheduler at: tcp://192.168.131.10:36517
2024-05-06 09:31:17,673 - distributed.scheduler - INFO -   dashboard at:  http://192.168.131.10:8787/status
2024-05-06 09:31:17,674 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-05-06 09:31:37,084 - distributed.scheduler - INFO - Receive client connection: Client-eb00bb20-0bac-11ef-b208-73d997266046
2024-05-06 09:31:37,842 - distributed.core - INFO - Starting established connection to tcp://192.168.131.10:35986
2024-05-06 09:31:37,845 - distributed.worker - INFO - Run out-of-band function 'gethostname'


start example
2024-05-06 09:31:37,846 - distributed.scheduler - INFO - ssh -N -L 8787:gpu-b10-3.zaratan.umd.edu:8787 zvladimi@login.zaratan.umd.edu
2024-05-06 09:31:53,158 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:46789'
2024-05-06 09:31:53,165 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:35119'
2024-05-06 09:31:53,167 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:41253'
2024-05-06 09:31:53,170 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:41807'
2024-05-06 09:31:53,171 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:40355'
2024-05-06 09:31:53,174 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:40837'
2024-05-06 09:31:53,175 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:38013'
2024-05-06 09:31:53,180 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.131.10:35101'
2024-05-06 09:33:15,393 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,393 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,425 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,425 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,508 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,508 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,522 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,522 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:16,509 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:16,509 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:16,616 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:16,616 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:17,076 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2024-05-06 09:33:17,096 - distributed.worker - INFO -       Start worker at: tcp://192.168.131.10:43969
2024-05-06 09:33:17,097 - distributed.worker - INFO -          Listening to: tcp://192.168.131.10:43969
2024-05-06 09:33:17,097 - distributed.worker - INFO -           Worker name:                        2-2
2024-05-06 09:33:17,097 - distributed.worker - INFO -          dashboard at:       192.168.131.10:36969
2024-05-06 09:33:17,097 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.131.10:36517
2024-05-06 09:33:17,097 - distributed.worker - INFO - -------------------------------------------------
2024-05-06 09:33:17,097 - distributed.worker - INFO -               Threads:                          1
2024-05-06 09:33:17,097 - distributed.worker - INFO -                Memory:                 117.19 GiB
2024-05-06 09:33:17,097 - distributed.worker - INFO -       Local Directory: /home/zvladimi/scratch/MLOIS/dask_logs/dask-scratch-space/worker-214xkhok
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin PreImport-e317a3f5-1cb0-4479-92ad-b8afaee8b1d4
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin RMMSetup-97ba5735-fc2a-48ad-93f7-fc5b13714381
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-3918c543-5a71-462b-b598-6edc06d63aad
2024-05-06 09:33:17,158 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:17,158 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:17,233 - distributed.worker - ERROR - [Errno 22] Invalid argument
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
2024-05-06 09:33:17,235 - distributed.worker - INFO - Stopping worker at tcp://192.168.131.10:43969. Reason: failure-to-start-<class 'OSError'>
2024-05-06 09:33:17,235 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2024-05-06 09:33:17,246 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-06 09:33:17,295 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
    result = await self.process.start()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
    msg = await self._wait_until_connected(uid)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
    raise msg["exception"]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-06 09:33:17,314 - distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.131.10:40837'. Reason: nanny-instantiate-failed
2024-05-06 09:33:17,314 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2024-05-06 09:33:17,334 - distributed.nanny - INFO - Worker process 1635143 was killed by signal 15
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
    result = await self.process.start()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
    msg = await self._wait_until_connected(uid)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
    raise msg["exception"]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zvladimi/scratch/MLOIS/dask_gpu.py", line 27, in <module>
    initialize(worker_class="dask_cuda.CUDAWorker", local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0")
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_mpi/core.py", line 134, in initialize
    asyncio.get_event_loop().run_until_complete(run_worker())
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_mpi/core.py", line 131, in run_worker
    async with WorkerType(**opts) as worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 242, in _wait
2024-05-06 09:33:17,335 - distributed.core - INFO - Lost connection to 'tcp://192.168.131.10:33644'
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 970, in _handle_comm
    result = await result
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/scheduler.py", line 4440, in add_nanny
    await comm.read()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
    convert_stream_closed_error(self, e)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://192.168.131.10:36517 remote=tcp://192.168.131.10:33644>: Stream is closed
2024-05-06 09:33:17,408 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2024-05-06 09:33:17,408 - distributed.worker - INFO -       Start worker at: tcp://192.168.131.10:39813
2024-05-06 09:33:17,409 - distributed.worker - INFO -          Listening to: tcp://192.168.131.10:39813
2024-05-06 09:33:17,409 - distributed.worker - INFO -           Worker name:                        3-3
2024-05-06 09:33:17,409 - distributed.worker - INFO -          dashboard at:       192.168.131.10:36053
2024-05-06 09:33:17,409 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.131.10:36517
2024-05-06 09:33:17,409 - distributed.worker - INFO - -------------------------------------------------
2024-05-06 09:33:17,409 - distributed.worker - INFO -               Threads:                          1
2024-05-06 09:33:17,409 - distributed.worker - INFO -                Memory:                 117.19 GiB
2024-05-06 09:33:17,409 - distributed.worker - INFO -       Local Directory: /home/zvladimi/scratch/MLOIS/dask_logs/dask-scratch-space/worker-3tuo2o4v
2024-05-06 09:33:17,409 - distributed.worker - INFO - Starting Worker plugin RMMSetup-68c58b95-6608-49fd-b3b3-dc0883b670b0
2024-05-06 09:33:17,409 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-651b4273-93be-4e3c-9da4-aec971805338
2024-05-06 09:33:17,414 - distributed.worker - ERROR - [Errno 22] Invalid argument
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
2024-05-06 09:33:17,415 - distributed.worker - INFO - Starting Worker plugin PreImport-2f0a7508-b20b-4992-a660-72692bf11194
2024-05-06 09:33:17,415 - distributed.worker - INFO - Stopping worker at tcp://192.168.131.10:39813. Reason: failure-to-start-<class 'OSError'>
2024-05-06 09:33:17,415 - distributed.worker - INFO - Closed worker has not yet started: Status.init
    await asyncio.gather(*self.nannies)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635147 parent=1634825 started daemon>
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635136 parent=1634825 started daemon>
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635129 parent=1634825 started daemon>
2024-05-06 09:33:17,426 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

There is some more to the output but it is mostly more of the same.

My environment is:

Python 3.10.10
Cuda 12.3.52
dask 2024.1.1
dask-cuda 24.4.0
dask-cudf-cu12 24.4.1
dask-expr 0.4.0
dask-mpi 2022.4.0
distributed 2024.1.1

I am new to using dask and so I could be missing something pretty obvious. I have tried using dask_jobqueue’s SLURMCluster in the past but ran into issues before but if that would be a better way than dask-mpi I can try that again. Any help would be appreciated thank you!

This could be a configuration issue on the HPC you are using. Searching the error seems to suggets that os.sched_setaffinity will fail in some CPU configurations.

https://www.google.com/search?q=os.sched_setaffinity+invalid+argument&client=firefox-b-d&sca_esv=6bc39e904ccee14e&sca_upv=1&ei=COY5ZtPIAq_KhbIPna-fuAM&oq=os.sched_setaffinity+inva&gs_lp=Egxnd3Mtd2l6LXNlcnAiGW9zLnNjaGVkX3NldGFmZmluaXR5IGludmEqAggBMgUQIRigATIFECEYoAEyBRAhGKABMgUQIRigAUjNMFCpE1jfJnAFeACQAQCYAWygAcEEqgEDNi4xuAEDyAEA-AEBmAILoAKyBMICDhAAGIAEGLADGIYDGIoFwgILEAAYgAQYsAMYogTCAgUQABiABMICCxAAGIAEGIYDGIoFwgIIEAAYgAQYogSYAwCIBgGQBgSSBwQxMC4xoAfNHA&sclient=gws-wiz-serp

One thing to try is to set --cpus-per-task=1 as dask-cuda uses one thread per GPU anyway.

Thank you for the response, I tried setting --cpus-per-task=1 and got the same result. Looking at the first stack overflow question from your link is #echo 1 > /sys/devices/system/cpu/cpu1/online something that I can do? If not do you have any suggestions on how to debug this issue as this is out of my depth?

It’s definitely a tricky one, without access to the system you are using or a way to reproduce it on one of our systems it’s not easy for us to look into.

Perhaps there is someone around you on an HPC support team that may be able to assist you with your particular machine’s configuration?

I see thank you for your help, there is a support team although when I last contacted them they seemed to think that it was more an issue with the packages I had installed and the interactions between them. I’ll follow up with the information you provided here and see if we can resolve this.

Hi, did you try using LocalCudaCluster in only one GPU first? I don’t believe this would solve your issue, but it would remove MPI layer from the burden.

Also, You might want to ensure that Cuda installed on your nodes are compatible with the Cuda you are using in your Python environment.

Hello, sorry for the late response got caught up in finals. I switched over to using LocalCudaCluster by commenting out the initialize functions and adding

cluster = LocalCUDACluster()
client = Client(cluster)

The output is:

Slurm job 6339686 running on
gpu-b10-3.zaratan.umd.edu
To run 1 tasks across 1 nodes
All nodes: gpu-b10-3
Mon May 13 03:37:18 EDT 2024
/home/zvladimi/scratch/MLOIS
Loaded modules are:
Currently Loaded Modulefiles:
 1) umd-software-library/new
 2) gcc/11.3.0
 3) python/3.10.10/gcc/11.3.0/linux-rhel8-zen2(default)
 4) openmpi/gcc/11.3.0/zen2/4.1.5
 5) hdf5/gcc/11.3.0/openmpi/4.1.5/zen2/1.14.0
 6) gsl/gcc/11.3.0/zen2/2.7.1
 7) fftw/gcc/11.3.0/openmpi/4.1.5/zen2/3.3.10
 8) cuda/gcc/11.3.0/zen2/12.3.0(default)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
2024-05-13 03:38:29,623 - distributed.worker - ERROR - [Errno 22] Invalid argument
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
2024-05-13 03:38:29,625 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-13 03:38:29,734 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
    raise plugins_exceptions[0]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
    result = await self.process.start()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
    msg = await self._wait_until_connected(uid)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
    raise msg["exception"]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
    await self                                                                                                                File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-13 03:38:29,743 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x150b60d7c8e0>>, <Task finished name='Task-16' coro=<SpecCluster._correct_state_internal() done, defined at /scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/deploy/spec.py:346> exception=RuntimeError('Nanny failed to start.')>)
Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
    result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
    result = await self.process.start()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
    msg = await self._wait_until_connected(uid)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
    raise msg["exception"]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:                                                                                                        File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__                                                                                                              await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start                                                                                                                   raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:

Traceback (most recent call last):                                                                                            File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/tornado/ioloop.py", line 750, in _run_callback
    ret = callback()                                                                                                          File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
    future.result()                                                                                                           File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/deploy/spec.py", line 390, in _correct_state_internal
    await asyncio.gather(*worker_futs)                                                                                        File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())                                                                                 File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
Task exception was never retrieved
future: <Task finished name='Task-292' coro=<_wrap_awaitable() done, defined at /scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/deploy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):                                                                                            File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for                                                                                                              return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for                return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe                                                                                                         raise plugins_exceptions[0]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper                                                                                                                return await func(*args, **kwargs)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add                                                                                                           result = plugin.setup(worker=self)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup                                                                                                                   os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument                                                                                                                                                                                                                    The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 653, in start
    raise self.__startup_exc
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
    return await fut
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
    result = await self.process.start()
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
    msg = await self._wait_until_connected(uid)
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
    raise msg["exception"]
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
    async with worker:
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
  File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.

start example
0
1
slurmstepd: error: *** JOB 6339686 ON gpu-b10-3 CANCELLED AT 2024-05-13T05:07:34 DUE TO TIME LIMIT ***

So it seems that the same thing happens although it does get through to making the datasets (that is the outputted 0 1) but then the job just times out even when I reduced the task to just doing

def example_function():
    print(f"start example")
    x = da.random.random((100, 100, 10), chunks=(10, 10, 5))
    print(0)
    y = da.random.random((100, 100, 10), chunks=(10, 10, 5))
    print(1)
    z = (da.arcsin(x) + da.arccos(y)).compute()
    print(z)

I believe that the cuda version is compatible as the output from nvcc --version in the slurm script is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

Which I believe is compatible with the installed version for Python: cuda-python 12.4.0

Would you mind opening an issue on the dask-cuda GitHub repo with that traceback so one of the maintainers can investigate. Please include all the steps required to recreate (extra points if you can recreate it on your laptop and not the HPC).