One of the SSHCluster workers doesn't pick up tasks

This simple example used to work and the distributed computations used to return the hostnames of both workers.

So what am I doing wrong or what could be the cause that it no longer works? The console output isn’t very helpful. Is there any other place to look at?

I can ssh without any user interaction between these two machines in any combination, so that’s not an issue.

levy@valardohaeris:~$ ipython3
Python 3.10.4 (main, Apr  2 2022, 09:04:19) [GCC 11.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import socket
   ...: 
   ...: from dask import delayed
   ...: from dask.distributed import Client, SSHCluster
   ...: 
   ...: cluster = SSHCluster(["valardohaeris", "valardohaeris", "valarmorghulis"])
   ...: client = Client(cluster)
   ...: delayed_hostnames = list(map(lambda i: delayed(socket.gethostname)(), range(0, 12)))
   ...: delayed_result = delayed(lambda elements: ", ".join(elements))(delayed_hostnames)
   ...: delayed_result.compute()
   ...: 
2022-05-03 12:20:04,471 - distributed.deploy.ssh - INFO - /home/levy/.local/lib/python3.10/site-packages/distributed/cli/dask_spec.py:39: DeprecationWarning: There is no current event loop
2022-05-03 12:20:04,471 - distributed.deploy.ssh - INFO - asyncio.get_event_loop().run_until_complete(run())
2022-05-03 12:20:04,592 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:04,591 - distributed.scheduler - INFO - Clear task state
2022-05-03 12:20:04,592 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:04,592 - distributed.scheduler - INFO -   Scheduler at: tcp://192.168.0.200:43239
2022-05-03 12:20:05,081 - distributed.deploy.ssh - INFO - /home/levy/.local/lib/python3.10/site-packages/distributed/cli/dask_spec.py:39: DeprecationWarning: There is no current event loop
2022-05-03 12:20:05,081 - distributed.deploy.ssh - INFO - asyncio.get_event_loop().run_until_complete(run())
2022-05-03 12:20:05,086 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:05,086 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.0.200:39459'
2022-05-03 12:20:05,397 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:05,396 - distributed.worker - INFO -       Start worker at:  tcp://192.168.0.200:40883
2022-05-03 12:20:05,503 - distributed.deploy.ssh - INFO - /home/levy/.local/lib/python3.10/site-packages/distributed/cli/dask_spec.py:39: DeprecationWarning: There is no current event loop
2022-05-03 12:20:05,503 - distributed.deploy.ssh - INFO - asyncio.get_event_loop().run_until_complete(run())
2022-05-03 12:20:05,725 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:05,700 - distributed.nanny - INFO -         Start Nanny at: 'tcp://192.168.0.52:35643'
2022-05-03 12:20:06,232 - distributed.deploy.ssh - INFO - 2022-05-03 12:20:06,214 - distributed.worker - INFO -       Start worker at:   tcp://192.168.0.52:37591
2022-05-03 12:20:36,566 - distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {'lambda-c1586ef7-d827-4953-ae3a-abcd5c8a2593': ('tcp://192.168.0.52:37591',)}
Out[1]: 'valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris, valardohaeris'

python

1 Like

This turned out to be a firewall problem. The firewall prevented the worker to be connected from the scheduler.

There was no indication of this in the log though, which is not very user friendly.

1 Like