Hey team! I have a quick question regarding some weird behaviour when I create a distributed cluster over some machines in LAN. When I start a worker alonside a scheduler in one machine (the master in the instance), workers tend to timeout often and I can get nothing to compute. When I killed the worker (schedule is now running solo on master) everything works perfect… Is that normal? Is it mentioned somewhere and I missed it? If it is a bug I will look into posting on github (just wanted to make sure that it is not something too trivial). Thanks a ton!
@giorgostheo Welcome to Discourse!
Is that normal? Is it mentioned somewhere and I missed it?
That does seem odd, and we’d need a little more information to say what’s going on. Would you be able to share the timeout error traceback, and describe how you’re setting up your distributed cluster?
I do not have extra info sadly since I took the cluster down due to the many errors I encountered… Dask was not at fault proly, I suspect it is the cloud provider problem (its an local academic one and you can not imagine how terribly it’s maintained).
All I know is that when I took the worker that existed alongside the scheduler down, the cluster stoped hanging… sorry I can’t more detail. Before, there where constant timeouts from random workers that lost connection when a computation started.
Reproducing should be easy tho if someone wants to take a look, just spin up a scheduler and worker on the same machine (I used the command too dask-scheduler and dask-worker) as well as some workers on other machines in LAN.
Again, I am 99% sure its not Dask’s fault, so maybe further investigation is unnecessary.