Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start"

Arnaud · April 14, 2022, 6:02pm

Hi

I’m using Dask on multiples infrastructures. The scheduler is first initialized, then the dask-workers join by different submission method (with dask-jobqueue OR NOT).

What I’m seeing, independently from the infrastructure, is that, with two dask-workers, it always work like a charm.

But when I start to increase the number of dask-workers (20 or 100 for instance), some of the dask-workers never give up.

Each time a dask-worker seems to fail to start, I have the exact same stack trace :

 distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
 distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
 distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
 distributed.dask_worker - INFO - End worker
 ...
 asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

How to debug please ?

Arnaud · April 14, 2022, 6:06pm

Full trace for HtCondor large infrastructure :

distributed.nanny - INFO -         Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/asyncio/tasks.py", line 466, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/asyncio/tasks.py", line 490, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/asyncio/tasks.py", line 492, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/site-packages/click/core.py", line 1126, in __call__
    return self.main(*args, **kwargs)
  File "/site-packages/click/core.py", line 1051, in main
    rv = self.invoke(ctx)
  File "/site-packages/click/core.py", line 1393, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/site-packages/click/core.py", line 752, in invoke
    return __callback(*args, **kwargs)
  File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/asyncio/tasks.py", line 688, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

Arnaud · April 14, 2022, 6:07pm

Full trace for PBS large infrastructure :

distributed.nanny - INFO -         Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
  File "/.../site-packages/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/.../asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/.../site-packages/distributed/core.py", line 269, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/.../asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/.../runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/.../runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
    go()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
    main()
  File "/.../site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/.../site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/.../site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/.../site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
    loop.run_sync(run)
  File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
    return future_cell[0].result()
  File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
    await asyncio.gather(*nannies)
  File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/.../site-packages/distributed/core.py", line 273, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds

pavithraes · April 27, 2022, 1:01pm

@Arnaud Welcome to Discourse and thanks for your question! This error can be caused by a lot of different things… Would you be able to share a minimal, reproducible example as well as some more details about your infrastructure or how you’re creating the cluster/workerks? It’ll help us better diagnose what’s going on!

I also found this issue that seems similar: Nanny Fails to Connect in 60 seconds · Issue #391 · dask/dask-jobqueue · GitHub

How to debug please ?

Some of these general debugging approaches might be helpful: Debug — Dask documentation

guillaumeeb · April 27, 2022, 2:50pm

Hi @Arnaud, out of curiosity, do you see a timeout error also with LocalCluster when you launch a lot of worker processes on a single node, as in [Best practice] Deploy a cluster on an interactive compute node on a slurm cluster ?

There seems to be more timeout issues with Workers connecting to Scheduler lately.

Topic		Replies	Views
Local Cluster with Two Nodes (Desktops) Distributed distributed	1	535	September 21, 2022
TimeoutError in distributed.nanny causing gRPC server crash after prolonged analysis Distributed	1	91	July 18, 2024
Client does not return workers, Job dies quickly Distributed scheduler	4	378	July 25, 2023
Running a cluster on an unreliable network Distributed dask-jobqueue , distributed	4	172	July 13, 2024
Distributed.nanny - ERROR - Error in Nanny killing Worker subprocess Distributed	7	270	October 11, 2023

Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start"

Related topics