I am trying to get XGBoost with dask setup on a SLURM cluster and was able to get this setup with using only CPUs. However, I want to use GPUs to accelerate the training. To start with I did just a basic calculation which is shown in the code:
from dask_mpi import initialize
import os
import dask.array as da
from dask.distributed import Client
import time
from contextlib import contextmanager
from distributed.scheduler import logger
import socket
@contextmanager
def timed(txt):
t0 = time.time()
yield
t1 = time.time()
print("%32s time: %8.5f" % (txt, t1 - t0))
def example_function():
print(f"start example")
x = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
y = da.random.random((100_000, 100_000, 10), chunks=(10_000, 10_000, 5))
z = (da.arcsin(x) + da.arccos(y)).sum(axis=(1, 2)).compute()
print(z)
if __name__ == "__main__":
initialize(worker_class="dask_cuda.CUDAWorker", local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0")
#initialize(local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0", nthreads=int(os.environ['SLURM_CPUS_PER_TASK']))
client = Client()
host = client.run_on_scheduler(socket.gethostname)
port = client.scheduler_info()['services']['dashboard']
login_node_address = "zvladimi@login.zaratan.umd.edu" # Change this to the address/domain of your login node
logger.info(f"ssh -N -L {port}:{host}:{port} {login_node_address}")
with timed("test"):
example_function()
The commented out initialize is what I used for running without GPUs. In addition I have this submission script:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --time=01:00:00
#SBATCH -p gpu
#SBATCH --mem-per-cpu=40000
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=3
#SBATCH --gpus-per-task=a100:1
#SBATCH --mail-type=ALL
mpirun python3 -u /home/zvladimi/scratch/MLOIS/dask_gpu.py
ECODE=$?
echo "Job finished with exit code $ECODE."
Which when I ran with just GPUs was the same script but not on the GPU partition and without the --gpus-per-task tag.
When I run this I get the following output with the main error seeming to be File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup os.sched_setaffinity(0, self.cores) OSError: [Errno 22] Invalid argument
:
2024-05-06 09:31:17,645 - distributed.scheduler - INFO - State start
2024-05-06 09:31:17,673 - distributed.scheduler - INFO - Scheduler at: tcp://192.168.131.10:36517
2024-05-06 09:31:17,673 - distributed.scheduler - INFO - dashboard at: http://192.168.131.10:8787/status
2024-05-06 09:31:17,674 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-05-06 09:31:37,084 - distributed.scheduler - INFO - Receive client connection: Client-eb00bb20-0bac-11ef-b208-73d997266046
2024-05-06 09:31:37,842 - distributed.core - INFO - Starting established connection to tcp://192.168.131.10:35986
2024-05-06 09:31:37,845 - distributed.worker - INFO - Run out-of-band function 'gethostname'
start example
2024-05-06 09:31:37,846 - distributed.scheduler - INFO - ssh -N -L 8787:gpu-b10-3.zaratan.umd.edu:8787 zvladimi@login.zaratan.umd.edu
2024-05-06 09:31:53,158 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:46789'
2024-05-06 09:31:53,165 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:35119'
2024-05-06 09:31:53,167 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:41253'
2024-05-06 09:31:53,170 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:41807'
2024-05-06 09:31:53,171 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:40355'
2024-05-06 09:31:53,174 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:40837'
2024-05-06 09:31:53,175 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:38013'
2024-05-06 09:31:53,180 - distributed.nanny - INFO - Start Nanny at: 'tcp://192.168.131.10:35101'
2024-05-06 09:33:15,393 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,393 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,425 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,425 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,508 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,508 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:15,522 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:15,522 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:16,509 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:16,509 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:16,616 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:16,616 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:17,076 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2024-05-06 09:33:17,096 - distributed.worker - INFO - Start worker at: tcp://192.168.131.10:43969
2024-05-06 09:33:17,097 - distributed.worker - INFO - Listening to: tcp://192.168.131.10:43969
2024-05-06 09:33:17,097 - distributed.worker - INFO - Worker name: 2-2
2024-05-06 09:33:17,097 - distributed.worker - INFO - dashboard at: 192.168.131.10:36969
2024-05-06 09:33:17,097 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.131.10:36517
2024-05-06 09:33:17,097 - distributed.worker - INFO - -------------------------------------------------
2024-05-06 09:33:17,097 - distributed.worker - INFO - Threads: 1
2024-05-06 09:33:17,097 - distributed.worker - INFO - Memory: 117.19 GiB
2024-05-06 09:33:17,097 - distributed.worker - INFO - Local Directory: /home/zvladimi/scratch/MLOIS/dask_logs/dask-scratch-space/worker-214xkhok
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin PreImport-e317a3f5-1cb0-4479-92ad-b8afaee8b1d4
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin RMMSetup-97ba5735-fc2a-48ad-93f7-fc5b13714381
2024-05-06 09:33:17,097 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-3918c543-5a71-462b-b598-6edc06d63aad
2024-05-06 09:33:17,158 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2024-05-06 09:33:17,158 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2024-05-06 09:33:17,233 - distributed.worker - ERROR - [Errno 22] Invalid argument
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
2024-05-06 09:33:17,235 - distributed.worker - INFO - Stopping worker at tcp://192.168.131.10:43969. Reason: failure-to-start-<class 'OSError'>
2024-05-06 09:33:17,235 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2024-05-06 09:33:17,246 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
async with worker:
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-06 09:33:17,295 - distributed.nanny - ERROR - Failed to start process
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
result = await self.process.start()
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
msg = await self._wait_until_connected(uid)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
raise msg["exception"]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
async with worker:
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
2024-05-06 09:33:17,314 - distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.131.10:40837'. Reason: nanny-instantiate-failed
2024-05-06 09:33:17,314 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-instantiate-failed
2024-05-06 09:33:17,334 - distributed.nanny - INFO - Worker process 1635143 was killed by signal 15
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 369, in start_unsafe
response = await self.instantiate()
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 455, in instantiate
result = await self.process.start()
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 762, in start
msg = await self._wait_until_connected(uid)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 903, in _wait_until_connected
raise msg["exception"]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/nanny.py", line 967, in run
async with worker:
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Worker failed to start.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/zvladimi/scratch/MLOIS/dask_gpu.py", line 27, in <module>
initialize(worker_class="dask_cuda.CUDAWorker", local_directory = "/home/zvladimi/scratch/MLOIS/dask_logs/", interface="ib0")
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_mpi/core.py", line 134, in initialize
asyncio.get_event_loop().run_until_complete(run_worker())
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_mpi/core.py", line 131, in run_worker
async with WorkerType(**opts) as worker:
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 678, in __aenter__
await self
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/cuda_worker.py", line 242, in _wait
2024-05-06 09:33:17,335 - distributed.core - INFO - Lost connection to 'tcp://192.168.131.10:33644'
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 970, in _handle_comm
result = await result
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/scheduler.py", line 4440, in add_nanny
await comm.read()
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
convert_stream_closed_error(self, e)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://192.168.131.10:36517 remote=tcp://192.168.131.10:33644>: Stream is closed
2024-05-06 09:33:17,408 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2024-05-06 09:33:17,408 - distributed.worker - INFO - Start worker at: tcp://192.168.131.10:39813
2024-05-06 09:33:17,409 - distributed.worker - INFO - Listening to: tcp://192.168.131.10:39813
2024-05-06 09:33:17,409 - distributed.worker - INFO - Worker name: 3-3
2024-05-06 09:33:17,409 - distributed.worker - INFO - dashboard at: 192.168.131.10:36053
2024-05-06 09:33:17,409 - distributed.worker - INFO - Waiting to connect to: tcp://192.168.131.10:36517
2024-05-06 09:33:17,409 - distributed.worker - INFO - -------------------------------------------------
2024-05-06 09:33:17,409 - distributed.worker - INFO - Threads: 1
2024-05-06 09:33:17,409 - distributed.worker - INFO - Memory: 117.19 GiB
2024-05-06 09:33:17,409 - distributed.worker - INFO - Local Directory: /home/zvladimi/scratch/MLOIS/dask_logs/dask-scratch-space/worker-3tuo2o4v
2024-05-06 09:33:17,409 - distributed.worker - INFO - Starting Worker plugin RMMSetup-68c58b95-6608-49fd-b3b3-dc0883b670b0
2024-05-06 09:33:17,409 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-651b4273-93be-4e3c-9da4-aec971805338
2024-05-06 09:33:17,414 - distributed.worker - ERROR - [Errno 22] Invalid argument
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
2024-05-06 09:33:17,415 - distributed.worker - INFO - Starting Worker plugin PreImport-2f0a7508-b20b-4992-a660-72692bf11194
2024-05-06 09:33:17,415 - distributed.worker - INFO - Stopping worker at tcp://192.168.131.10:39813. Reason: failure-to-start-<class 'OSError'>
2024-05-06 09:33:17,415 - distributed.worker - INFO - Closed worker has not yet started: Status.init
await asyncio.gather(*self.nannies)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 672, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635147 parent=1634825 started daemon>
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635136 parent=1634825 started daemon>
2024-05-06 09:33:17,425 - distributed.process - INFO - reaping stray process <SpawnProcess name='Dask Worker process (from Nanny)' pid=1635129 parent=1634825 started daemon>
2024-05-06 09:33:17,426 - distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/core.py", line 664, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 1940, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/cvmfs/hpcsw.umd.edu/spack-software/2023.11.20/views/2023/linux-rhel8-zen2/gcc@11.3.0/python-3.10.10/._compiler/pjwg6z5apkhw465uwep5sfp5va4f27ic/linux-rhel8-zen2/gcc/11.3.0/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1473, in start_unsafe
raise plugins_exceptions[0]
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/distributed/worker.py", line 1873, in plugin_add
result = plugin.setup(worker=self)
File "/scratch/zt1/project/diemer-prj/user/zvladimi/MLOIS/.myvenv/lib/python3.10/site-packages/dask_cuda/plugins.py", line 14, in setup
os.sched_setaffinity(0, self.cores)
OSError: [Errno 22] Invalid argument
There is some more to the output but it is mostly more of the same.
My environment is:
Python 3.10.10
Cuda 12.3.52
dask 2024.1.1
dask-cuda 24.4.0
dask-cudf-cu12 24.4.1
dask-expr 0.4.0
dask-mpi 2022.4.0
distributed 2024.1.1
I am new to using dask and so I could be missing something pretty obvious. I have tried using dask_jobqueue’s SLURMCluster in the past but ran into issues before but if that would be a better way than dask-mpi I can try that again. Any help would be appreciated thank you!