Deploy dask docker containers over multiple machines

CryoDrakon · June 30, 2022, 4:10pm

Hello everyone!
I am currently trying to utilize dask so I can run multiple instances on different machines on demand. I am used to working with docker containers so I have launched a scheduler on our Synology NAS and four workers on two servers (two workers per machine, using --nworkers 2). I am then able to connect to the scheduler and see information about the workers. I execute a command, like in the example x = client.submit(inc, 10), but when trying to execute x.result() the worker crashes with the message:

dask-worker | 2022-06-30 15:30:54,139 - distributed.worker - INFO - -------------------------------------------------
dask-worker | 2022-06-30 15:30:54,139 - distributed.core - INFO - Starting established connection
dask-worker | 2022-06-30 15:30:54,141 - distributed.core - INFO - Starting established connection
dask-worker | 2022-06-30 15:33:22,947 - distributed.worker - INFO - Stopping worker at tcp://172.28.0.2:33475
dask-worker | 2022-06-30 15:33:22,954 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-aea8ec1a-3aae-4fb1-a0f4-a3d986cd9590 Address tcp://172.28.0.2:33475 Status: Status.closing
dask-worker | 2022-06-30 15:33:22,956 - distributed.nanny - INFO - Worker closed
dask-worker | 2022-06-30 15:33:22,956 - distributed.nanny - ERROR - Worker process died unexpectedly
dask-worker | 2022-06-30 15:33:23,184 - distributed.nanny - INFO - Closing Nanny at 'tcp://172.28.0.2:41050'.
dask-worker | 2022-06-30 15:33:52,954 - distributed.worker - INFO - Stopping worker at tcp://172.28.0.2:38053
dask-worker | 2022-06-30 15:33:52,960 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-205b67de-2e43-4a8a-82af-4d2dc01e3a4a Address tcp://172.28.0.2:38053 Status: Status.closing
dask-worker | 2022-06-30 15:33:52,962 - distributed.nanny - INFO - Worker closed
dask-worker | 2022-06-30 15:33:52,963 - distributed.nanny - ERROR - Worker process died unexpectedly
dask-worker | 2022-06-30 15:33:53,209 - distributed.nanny - INFO - Closing Nanny at 'tcp://172.28.0.2:33479'.
dask-worker | 2022-06-30 15:33:53,211 - distributed.dask_worker - INFO - End worker

I have been following the example from the site, except for having the containers on the same network. Would that be essential to solving my problem? And would I need to use docker swarm to connect the machines or is there a workaround?

jacobtomlinson · August 17, 2022, 10:17am

Yes the scheduler and workers need to be on the same network.

Given that the error says Worker process died unexpectedly the worker logs may have more information.

wh1t3rabit · August 2, 2023, 7:49am

In most cases, you also need to use a shared volume on all workers in the same network, in order to save the temp files during the processing.

jacobtomlinson · August 2, 2023, 8:34am

Workers do not need a shared volume for their temp tiles. But the more performant the filesystem the better.

Topic		Replies	Views
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1207	November 3, 2023
LocalCluster deploying Deploying Dask distributed	1	230	January 15, 2023
Local Cluster with Two Nodes (Desktops) Distributed distributed	1	533	September 21, 2022
Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start" Distributed distributed	4	2181	April 27, 2022
Multiple processes per worker while using gateway Distributed dask-gateway , distributed	7	855	April 27, 2022

Deploy dask docker containers over multiple machines

Related topics