Inference issue using dask-jobqueue in Slurm cluster

Hello Folks,

I am writing to you to seek your assistance with a problem that I am encountering while using dask-jobqueue in a Slurm cluster. Specifically, I have written a code that uses dask-jobqueue to connect workers to a scheduler in order to execute my tasks. While my code runs without any errors, I have recently noticed that the workers are not connecting to the scheduler, and the log file is displaying an error message indicating that the interface I specified is not recognizable, even though I have chosen “ib0” as the interface.

I am wondering if you could help me resolve this issue by suggesting any possible solutions or providing any guidance on how to identify the correct interface. Your assistance would be greatly appreciated.

cluster = SLURMCluster(
    cores=3,
    processes=1,
    memory="15GB",
    shebang="#!/usr/bin/env bash",
    queue="****",
    walltime="01:00:00",
    death_timeout="30s",
    interface="ib0",
)
2023-03-31 13:04:34,234 - distributed.nanny - INFO - Closing Nanny at 'not-running'. Reason: nanny-close
2023-03-31 13:04:34,236 - distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/scicore/home/roeoesli/valipo0000/training/anaconda3/envs/py38/lib/python3.8/site-packages/distributed/core.py", line 528, in start
    await asyncio.wait_for(self.start_unsafe(), timeout=timeout)
  File "/scicore/home/roeoesli/valipo0000/training/anaconda3/envs/py38/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/scicore/home/roeoesli/valipo0000/training/anaconda3/envs/py38/lib/python3.8/site-packages/distributed/nanny.py", line 331, in start_unsafe
    start_address = address_from_user_args(
  File "/scicore/home/roeoesli/valipo0000/training/anaconda3/envs/py38/lib/python3.8/site-packages/distributed/comm/addressing.py", line 290, in address_from_user_args
    host = get_ip_interface(interface)
  File "/scicore/home/roeoesli/valipo0000/training/anaconda3/envs/py38/lib/python3.8/site-packages/distributed/utils.py", line 208, in get_ip_interface
    raise ValueError(
ValueError: 'ib0' is not a valid network interface. Valid network interfaces are: ['lo', 'eth2', 'eth0', 'eth1']

Hi @daskforscience, welcome to this community!

This message is saying that you don’t have any ib0 network interface on the server running your Worker. So you should either try the other eth* ones. Also, make sure this interface also exist on the machine running the Scheduler.

Hi @guillaumeeb

Thank you for your response - Although ib0 exist, I also checked the others. But still does not work. However, I think I found the solution. I replaced interface='ib0' with scheduler_options={"interface":"ib0"} and it worked.

Kind regards,
Behzad

This means that the ib0 interface is available on the machine were you run your Scheduler, but not on the machine were the workers are running. So you may have to specify a different interface for Scheduler than for Workers, which seems to be the case for you

1 Like