Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start"

Hi

I’m using Dask on multiples infrastructures. The scheduler is first initialized, then the dask-workers join by different submission method (with dask-jobqueue OR NOT).

What I’m seeing, independently from the infrastructure, is that, with two dask-workers, it always work like a charm.

But when I start to increase the number of dask-workers (20 or 100 for instance), some of the dask-workers never give up.

Each time a dask-worker seems to fail to start, I have the exact same stack trace :

 distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
 distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
 distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
 distributed.dask_worker - INFO - End worker
 ...
 asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

How to debug please ?

Full trace for HtCondor large infrastructure :

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/asyncio/tasks.py", line 466, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/site-packages/click/core.py", line 1126, in __call__
    return self.main(*args, **kwargs)
  File "/site-packages/click/core.py", line 1051, in main
    rv = self.invoke(ctx)
  File "/site-packages/click/core.py", line 1393, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/site-packages/click/core.py", line 752, in invoke
    return __callback(*args, **kwargs)
  File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

Full trace for PBS large infrastructure :

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/.../site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/.../asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/.../asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/.../site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/.../site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/.../site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.../site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/.../site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

@Arnaud Welcome to Discourse and thanks for your question! This error can be caused by a lot of different things… Would you be able to share a minimal, reproducible example as well as some more details about your infrastructure or how you’re creating the cluster/workerks? It’ll help us better diagnose what’s going on!

I also found this issue that seems similar: Nanny Fails to Connect in 60 seconds · Issue #391 · dask/dask-jobqueue · GitHub

How to debug please ?

Some of these general debugging approaches might be helpful: Debug — Dask documentation

Hi @Arnaud, out of curiosity, do you see a timeout error also with LocalCluster when you launch a lot of worker processes on a single node, as in [Best practice] Deploy a cluster on an interactive compute node on a slurm cluster ?

There seems to be more timeout issues with Workers connecting to Scheduler lately.

1 Like